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About Network Health Variables 



Network Health uses a proprietary technology called the management 
information base (MIB) translation file (MTF). This technology allows Network 
Health to normalize data collected from standard and proprietary agents 
available from multiple vendors. By normalizing data, Network Health can 
reliably analyze data collected from different agents and vendors and display 
it in the same report using a standard set of labels. 



About MIB Translation Files 

For each type of element (such as Ethernet, Frame Relay, Asynchronous 
Transfer Mode (ATM), remote access devices, routers, and servers), 
Network Health assigns a set of variables to columns in the database. To 
make those assignments, it requires an MTF for each element that is to be 
polled at a device. For example, if the Simple Network Management Protocol 
(SNMP) agent at a device supports Ethernet, Fiber Distributed Data Interface 
(FDDI), Token Ring, and Frame Relay, Network Health requires an MTF for 
each element type. 

Each MTF identifies the associated MIB and its filename, an agent for this 
element type, and a set of statements that map MIB variables to the 
appropriate database column. When an element is discovered, Network 
Health assigns the appropriate agent type to it. Network Health polls for data 
for the variables defined in the MTF only. 

With an MTF, Network Health specifies only those MIB attributes used to 
generate reports. Often the information required for analysis exists in either a 
subset of a MIB table or in multiple MIB tables in the agent. Using an MTF, 
Network Health can combine data from different tables in the MIB that are 
indexed in the same way. In addition, an MTF can combine standard MIB 
information with proprietary extensions in a single poll. 
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About Report Labels 



Using the MTF technology, you can request that Network Health poll devices 
for specific variables, define the mapping to database columns for those 
variables, create the type of labels that you want to appear in reports, and run 
Trend reports on your variables. Refer to Chapter 2, "Creating an MTF," for 
instructions. 

About Report Labels 

For each element type, Network Health establishes a set of variable labels. A 
label is associated with each database column based on the element type. 
When you run Network Health, it displays the correct labels in reports and in 
lists for selecting variables on which to run Trend reports. 

After creating an MTF, you can associate your variables with labels provided 
by Network Health or labels that you create. Network Health provides ASCII 
files that you can modify to create your own labels. Refer to Chapter 3, 
"Adding Variable Labels," for instructions. 
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Creating an MTF 



To create an MTF, you must do the following: 

1 . Construct the MTF, 

2. Create a compiled MIB. 

3. Add the agent that is assigned to the MTF to Network Health. 

4. Restart the Network Health server. 

This chapter describes each of these steps in detail. 

Constructing the MTF 

An MTF describes the mapping of data from a source (such as a MIB, or data 
imported from a database data information (DDI) file) to the columns in the 
Network Health database. To construct an MTF, you can either edit an 
existing MTF to add your mappings or write a new MTF. 

If you edit an existing MTF, every time you reinstall or upgrade Network 
Health, the installation process copies to the nethealth/ changed directory any 
MTFs and compiled MIBs that you modified. You must copy them back to the 
poller directory. If you write a new MTF, reinstalling or upgrading Network 
Health does not affect your MTFs or compiled MIBs. 

Network Health supports using indexes to access MIB variables. It only 
collects statistical data that is a counter or a gauge. This section describes 
assigning indexes to variables and how to define your data as either a counter 
or a gauge. 
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Writing an MTF 

An MTF is an ASCII text file that uses nested statements (enclosed in braces) 
to define the set of attributes to use. It begins with the keyword mib and the 
name of the MIB being translated. Generally, you should use the name of the 
MIB as part of the MTF filename. 

The MTF includes nested statements for the following types of information: 
support, data source, and translation. The following is a sample mib2.mtf file: 

mib mib2 

{ 

file mib2.mib 
version 2 

agent "MIB2 (wan port) " 
translation 

{ 

mediaType = -100 
mediaSpeed = ifSpeed% 
operStatus = if OperStatus% 
operStatusLastChange = if LastChange% 
variablel = if InUcastPkts + if InNUcastPkts + 
iflnErrors + iflnDiscards + if InUnknownProtos 
variable2 = iflnOctets 
variable3 = if InNUcastPkts 

variable4 = if InNUcastPkts + if OutNUcastPkts 

variablelO = iflnErrors 

variable9 = iflnDiscards 

variablel 6 = if InUnknownProtos 

variable22 = if InUcastPkts + if InNUcastPkts + 

ifOutUcastPkts + if OutNUcastPkts + iflnErrors + iflnDiscards 

+ if InUnknownProtos 

variable2 3 = iflnOctets + ifOutOctets 
variable24 = iflnErrors + ifOutErrors 
variable2 5 = iflnDiscards + if OutDiscards 

} 

} 

Note — 

To create your own MTF, you can copy and rename an existing MTF. 
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Support Information 

The support information section includes the following variable statements: 

file mibFilename 
version number 
aggregateOrily value 
agent "agent Text St ring" 

Each variable is required and must have the appropriate value, as defined in 
Table 2-1. 



Table 2-1: Support Information 



Variable 


Definition 


file 


The filename for the MIB being translated by this MTF. 
The corresponding filename.pcm file must reside in 
the poller directory of the Network Health installation. 


version 


The number of the MTF format. Only the value 2 is 
supported. 


aggregateOnly 


Indicates whether the element is a form of parent 
element that is not polled but exists for reporting and 
aggregation purposes. The default value is no. 

If set to yes, this statement indicates that the element is 
not polled, but it is used to collect aggregate data for 
children elements. Modem pools are an example of this 
type of element. 


agent 


The text for the agent that appears in the Poller 
Configuration dialog box. You must create a unique 
string for your MTF. 



Data Source Information 

The dataSourcelnfo section of an MTF provides information concerning 
response elements (response paths). The data source information begins with 
the dataSourcelnfo keyword followed by an open brace on a new line as 
follows: 

dataSourcelnfo 

{ 



Network Health Customizing Variables 



2-3 



2 CREATING AN MTF 

Constructing the MTF 



This section contains three variable statements in the following format: 
dataSourceType dataSourceType 
presVarListName presVarListName 
protocol protocol 

Each statement must occupy a single text line. These statements are required 
and must have the appropriate value. 

The first variable, dataSourceType, indicates the kind of data that Network 
Health should expect to collect from the polled device. Using this variable, 
Network Health distinguishes among the protocol measurement capabilities 
of various kinds of routers. For data import purposes, you should set the 
value to NotApplicable. 

The second variable, presVarListName, represents the value of the keyword 
in jietijeaJfVpoller/protocols.vars, which defines the data fields from which 
this MTF file is expected to extract necessary configuration information. 
Using this variable, Network Health determines which parameter variables 
are applicable to various kinds of routers. For data import purposes, you 
should set the value to genericResponsePath. 

The last variable in the dataSourcelnfo section, protocol, defines the protocol 
measured by elements of this type. This variable controls the protocols that 
you view in Network Health reports. Table 2-2 lists the valid values for each 
supported protocol. 



Table 2-2: Supported Protocols for dataSourcelnfo 



Protocol 


Value 


Ping 


ICMP 


UDP 


UDP 


DNS 


domain 


HTTP 


www-http 


TCP Connect 


TCP 


Jitter 


Jitter 


Sybase SQL 


sybase 


Oracle SQL 


orasrv 


Microsoft SQL 


ms-sql-s 


SAP-R3 


sap-r3 
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Table 2-2: Supported Protocols for dataSourcelnfo (Continued) 



Protocol 


Value 


Oracle Forms 


oraforms 


Lotus Notes 


lotusnote 


Microsoft Exchange 


msexch-routing 


PeopleSoft 


peoplesoft 


Citrix 


ica 


Mail (POP3) 


pop3 


Mail (SMTP) 


smtp 


Other Network Protocol 


other-net 


Other Application Protocol 


other-app 


Telnet 


telnet 


FTP 


ftp 


Other SQL 


other-sql 


Network News (NNTP) 


nntp 



Translation Information 

The translation information section begins with the translation keyword 
followed by an open brace on a new line as follows: 

translation 
{ 

In the translation section, statements identify required information and map 
one or more MIB variables to a database column. The variable statements 
follow this format: 

mtf Variable = expression 

Each statement must occupy a single text line. Table 2-3 lists the variables 
used in the translation information section and their valid values . 
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Table 2-3: Translation Information 



Variable 


Definition 


mediaType 


j^JCL.iii.Co ll ic LV|JC UI tlclllclll. 1 Illo lo I cUUll cU. 11 yOU ale 




uouig aii existing ciciiicni type, speciiy one or tne 




follfMArincr waliioc* 

IvJlJ-UWlI ltd VdlUCo. 




6 Ethernet LAN 




-1 Token Ring LAN 




-2 MIB2 LAN 




-100 WAN 




-101 Frame Relay 




-102 MDBS 




-105 ATM Port 




-106 ATM Path 




-107 ATM Channel 




-200 Router 




-201 Router with Cache 




-250 Router CPU 




-251 Router CPU with Cache 




-300 Server 




-301 Server with no Virtual Memory 




-302 Server with no Memory 




-303 BMC Windows NT Server 




-304 BMC UNIX Server 




-305 Empire Windows NT Server 




-306 Empire UNIX Server 




-330 Server CPU 




-350 User Partition 




-352 BMC Windows NT Partition 




-353 BMC UNIX Partition 




-370 Server Disk 




-371 BMC Server Disk 




-502 Server MIB2 LAN 




-600 Server WAN 




-700 Modem 




-701 ISDN interface 




-725 Remote access server (RAS) 




-750 RAS CPU 




-775 Modem pool 
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Table 2-3: Translation Information (Continued) 



Variable 


Definition 


mediaType (continued) 


-800 Network path 




-801 Network path for voice over IP 




-802 Network path for application protocols 




-803 Network path element identifier for FirstSense 




-805 Network path element identifier for Empire 




^prvirp RpcnnnQp 

w?Cl VltC IXCoUUl IDC 




-1000 Traffic Accountant probe 




-3000 System Partition 




-3001 BMC Windows NT System Partition 




-3002 BMC UNIX System Partition 




-3100 UNIX Process Set 




-3101 Windows NT Process Set 




-3200 UNIX Process Set Excluded 




-3201 Windows NT Process Set Excluded 




-3300 UNIX Process 




-3301 Windows NT Process 




Tf vnn fTPJitp a upw plprnpnt tvnp <jp1prf a \/5»1iip from tHp 

11 JUU CuLC a 11CW ClClIlCi 1L LV JJC, oC1C\-L d V Cll UC 11 (Jill L11C 




following ranop^' 




-50 to -99 LAN 




-150 to -199 WAN 




-225 to -249 Router 




-275 to -299 Router CPU 




-315 to -329 Server 




-340 to -349 Server CPU 




-oou iu -ooy ocrver paiTiiion 




-385 to -399 Server disk 




-713 to -724 Modem/ISDN 




-738 to -749 RAS 




-763 to -774 RAS CPU 




-788 to -799 Modem pool 




-850 to -899 Network path 




-3050 to -3099 Server system partition 




-3150 to -3199 Server process set 




-3250 to -3299 Server excluded process set 




-3350 to -3399 Server system partition 




You must include the minus sign (-) for all element 




types other than Ethernet, 
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Table 2-3: Translation Information (Continued) 



Variable 


Definition 


mediaSpeed 


Specifies what the poller uses to obtain the interface 
speed. This is required. For full-duplex interfaces, this is 
the incoming interface speed. You can specify the speed 
as a value in bits ner second nr thp MTR variahlp Tf tho 
speed is in units other than bits per second, you must 
convert it in this statement. You should designate this as 
a gauge by including a percent sign (%). 


mediaSpeedOut 


Specifies what the poller uses to obtain the outgoing 
interface speed. This is used only for full-duplex 
interfaces. You can specify the speed as a value in bits 
per second or as the MIB variable. If the speed is in units 
other than bits per second, you must convert it in this 
statement. You should designate this as a gauge by 
including a percent sign (%). 


operStatus 


Specifies the MIB variable (such as ifOperStatus) the 
poller uses to obtain the interface operational status. 
I his is optional. You should designate it as a gauge by 
including a percent sign (%). 

Network Health interprets data for this variable based 
uij me livjpeioLaTus enumeration irom mid. 


operStatusLastChange 


Specifies the MIB variable the poller uses to obtain the 
last operational status change. This is optional and 
should be designated as a gauge by including a percent 
sign (%). 

Network Health interprets data for this variable based 

f~lTl trip itT 0 cf"f~^ V"i r> f» ires oni imorotir\r> frr\rvi "N /f TT3 O 

un u ie iij_.cii>iv^nange enumerauon rrom iviid^. 


sysUpTime 


Specifies the MIB variable that the poller uses to obtain 
the system uptime. This is optional and should be 
designated as a gauge by including a percent sign (%). 


availableTime 


Specifies the amount of time in seconds that the element 
is available for the duration of the current polling 
interval. 

If this variable is present, the availability data will be 
based on this value rather than calculations based on 
ifOperStatus, ifLastChange, and sysUpTime. 


reachableTimeSec 


Specifies the sum of the response time, the amount of 
time in seconds that the element was reachable. 
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Table 2-3: Translation Information (Continued) 



Variahlp 


ueTiniiion 


latencyMsec 


Specifies the time in milliseconds that it took for the 
roundtrip delay to reach the element. 


totalTime 


Specifies the total amount of time during which the 
element was polled. For data import purposes, the 
totalTime and deltaTime variables are equivalent. 


variableiV 


Specifies the actual MIB variable or variables to be 
mapped to the column identified as variable 1 through 
variable30. 



Rules Concerning the Creation of MTFs 

When creating an MTF, keep in mind the following: 

• When specifying the mediaSpeed variable for any element that does not 
support the concept of speed (such as a router or server disk), you must 
specify zero (0). Do not leave the value blank. 

• You must include every MIB variable for which you want data collected, 
not just your specific variables. 

• Before you specify a variableN, make sure that it is available for your use. 
Refer to Appendix A for a list of column assignments for variables. 

• The variables that you select to include in the MTF must be indexed in the 
same way. 

• You can specify only the following operators in the MTF: plus (+), 
minus (-), multiply (*), and divide (/). 

• Both the translation information and the MTF must end with a close brace 
(}), preferably on a new line. 

Note 

You must have two close braces 0) at the end of the file. 

Using Indexes to Access Variables 

Network Health supports up to four indexes in the poller configuration. 
You can use these indexes to access your variables by appending the index 
number to the MIB variable name. For example, if you wanted to access the 
iflnOctets variable at index 12, you might specify the following: 

variable28 = iflnOctets. 12 
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Indexes do not have to be constants. Network Health supports the notation 
$1, $2, $3, or $4 to indicate an index as defined in the poller configuration. For 
example, if index 1 in the poller configuration is index 12, you might specify 
the following: 

variable28 = ifInOctets.$l 

Network Health assumes index 1 if you do not include an index with the 
variable name. The following statement is identical to the one described 
previously: 

variable28 = iflnOctets 

If your MIB variable does not support indexes, you must append a zero to the 
variable name. For example: 

variable30 = bufferNoMem.O 

Note _ — 

The index number does not imply order of use. 

Using Counters and Gauges 

Network Health only collects statistical data on two data types: counters and 
gauges. A counter is a non-negative integer which monotonically increases 
until it reaches a maximum value, after which it wraps around and starts 
increasing again from zero. A gauge is a non-negative integer which may 
increase or decrease. By default, Network Health assumes that the data type 
for a variable is a counter unless you indicate that it is a gauge by appending 
a percent sign (%) to the variable. 

Network Health does not handle counters and gauges in the same way. Each 
time that it polls a counter, it subtracts the value of that counter in the 
previous poll interval from its value in the current poll interval to obtain a 
counter difference, which it stores in the database. It subsequently divides the 
database value by the polling interval to obtain a rate. 

In contrast, when Network Health polls a gauge, it stores the gauge directly 
in the database without performing a subtraction. When polling a report 
gauge, it subsequently divides the report gauge by the poll period. When 
polling a calculation gauge, it does not multiply the calculation gauge by the 
poll interval. 
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When collecting data for a gauge, you must normalize it to a counter before 
Network Health can store it in the database. This is a requirement for the 
database rollups. To normalize a gauge to a counter, you must multiply the 
value returned for the variable by the time between the last poll and the 
current poll, which is represented by the deltaTime MTF variable, divided by 
100. For any variable that specifies a gauge for a data type, you express that 
variable as follows: 

gaugeVariable% * (deltaTime / 100) 

The deltaTime MTF variable is expressed in centiseconds, and the database 
requires units in seconds. 

Using Function Call Syntax in an MTF Expression 

The MTF expression language supports function call syntax. Functions have 
the following format: 

functionName {argl, . . . , argn) 

MTFs support the following defined functions: 

• round 

• switch 

• constArrayMap 

• counter64 

• nwbCounter64 

• snmpCounter64 

• use Wrapped Value 

• isAggregated 

• min 

• max 

• null data 

The round Function 

The round function has the following syntax: 

round (x) 

The value of xis a variable. This function rounds the value of xto the nearest 
integer. For example: 
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• round (22.2) = 22 

• round (22.87) = 23 

• round (42.5) = 43 

The switch Function 

The switch function has the following syntax: 
switch (x, dl, rl, . .., dn, rn) 

The values of x, dl and rl may be any expressions. This function is a general 
quality-based conditional It evaluates a: and compares the result for equality 
with the evaluated values of dl, d2, and so on, in succession, until it finds a 
match. When it finds a match, the switch function returns the evaluated value 
ri (rl, r2, and so on). If it does not find a match, it returns 0. 

The constArrayMap Function 

The constArrayMap function has the following syntax: 
constArrayMap (x, cO, cl, . .., cn-1) 

This function maps one set of values to another set of variable values. It 
truncates the value of xto an integer, if necessary, and uses the integer value 
as an index to the set of constants shown as c0, ci, up to cn-1. The c values 
must be constants. The function checks these values when the MTF is parsed 
and returns c[x]. 

Note 

You must have a constant for each possible value of x, otherwise, Network 
Health generates a runtime error. If a: is not in the domain from 0 to n-1, 
the result is 0. 

For example: 

variable = constArrayMap (x, 12,4, 7,22,40) 

When x is 0.25, the function truncates the value to 0. The 0 index value in the 
constant array is 12; thus, the variable value evaluates to 12. When x is 1, the 
value is 1. The 1 index value in the array is 4; thus, the variable value evaluates 
to 4. 
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The counter64 Function 

The counter64 function has the following syntax: 
counter64 (hi, lo) 

The value of hi is the high 32 bits of a 64-bit counter value and the value of lo 
is the low 32 bits of a 64-bit counter value. 

Network Health obtains the high and low portions of the 64-bit counter 
from different MIB variables and concatenates them to form a single 64-bit 
value. It then uses this value to calculate deltas (without loss of precision). 
As with 32-bit counters, Network Health checks the wrap. If it detects a delta 
greater than half the word resolution (in this case, a delta greater than 2 63 - 1 = 
9223372036854775807), it generates a wrap error. 

The nwbCounter64 Function 

The nwbCounter64 function has the following syntax: 
nwbCounter64 (x) 

The value of x must be a variable. This function interprets the value of x 
as a Newbridge octet-string based 64-bit counter. In contrast to the 
snmpCounter64 function, you must use this function to denote this type 
of counter. Since the use of an octet string as a counter is nonstandard, if the 
type is not octet-string— and a Newbridge 64-bit counter is the variable to 
which the value is to be bound— the poller checks the returned SNMP type in 
the response packet and generates an error. 

The setSnmpVersion Function 

The setSnmpVersion function sets the SNMP version of the packets sent by 
the poller. By default, all packets are SNMP version 1. If your device requires 
a different SNMP version, you can use the setSnmpVersion function to 
specify it in the MTF file for the device. 

The setSnmpVersion function has the following syntax: 
setSnmpVersion ( version) 

The value of vbits ion can be 1, 2, 2c, or 3. 
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The snmpCounter64 Function 

The snmpCounter64 function has the following syntax: 
snmpCounter64 (x) 

The value of x must be a variable. This function interprets the value of x as an 
SNMPv2 64-bit counter. It is an optional function because it is dependent on 
the agent. If 64-bit SNMPv2 counters appear in a MIB, and they are properly 
tagged as such in the response SNMP packet, the poller reports on them 
correctly. 

The useWrappedValue Function 

The useWrappedValue function has the following syntax: 
useWrappedValue (x) 

The value of x must be a variable. This function compares the current value of 
x to its value at the previous poll. If x is lower than that value, the function 
evaluates it to the absolute value of x. If xis higher than that value, the 
function evaluates it to the difference of the current value minus the value at 
the previous poll. 

The isAggregated Function 

The isAggregated function has the following syntax: 
VarJV = isAggregated () 

The isAggregated function specifies that this MTF variable is an aggregation 
of data from another element (such as a child element). It is a marker that 
denotes which columns are defined and aggregated internally by the poller. 
Its presence or absence has no effect on the data being placed in the database. 
Instead, isAggregated marks columns in the MTF that would not otherwise 
be specified so reports that reference the column will know the column is 
valid. Within the MTF, you can remove the isAggregated statement from a 
variable, thus making that variable untrendable for the parent. You should 
never add isAggregated to a variable that is not already marked that way in 
another MTF of the same media type. The only option is to remove 
isAggregated, which should be done when a device does not support the 
aggregated column. 
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If you are defining a new parent element based on an existing element type, 
you can mimic the current aggregation schemes, specify another variable 
instead, or remove the aggregation scheme. You cannot add isAggregated to 
a variable that was never aggregated in the parent type that you are 
modeling. 

The min function 

The min function has the following syntax: 
min (x, y) 

The values of x and y may be any expressions. This function returns the 
minimum value of x and y 

The max function 

The max function has the following syntax: 

max (x, y) 

The values of x and y may be any expressions. This function returns the 
maximum value of x and y 

The nullData Function 

The nullData function has the following syntax: 
variable = nullData () 

This function interprets any variable as null that is missing from the MTF, but 
that the poller collects by default. For example, the poller cannot measure an 
import element if it does not have a definition in the import data file. The 
nullData function interprets the import element as null to ensure that the 
poller does not report on it. 



Compiling MIBs 

Network Health uses a precompiled MIB (PCM) file to determine the MIB 
variables for which you want to collect data during the poll. A PCM file 
contains the name and object identifier (OID) for the variables that are 
defined in all MTFs that reference the MIB. In addition, each PCM file 
contains additional variables required by Network Health to discover and 
poll the element. 
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For example, the PCM file for the mib2.mib is as follows: 

hrDeviceErrors 1.3.6.1.2.1.25.3.2.1.6 

hrDiskStorageCapacity 1.3.6.1.2.1.25.3.6.1.4 

hrMemorySize 1.3.6.1.2.1.25.2.2 

hrProcessorLoad 1.3.6.1.2.1.25.3.3.1.2 

hrStorageAllocationFailures 1.3.6.1.2.1.25.2.3.1.7 

hrStorageAllocationUnits 1.3.6.1.2.1.25.2.3.1.4 

hrStorageSi ze 1.3.6.1.2.1.25.2.3.1.5 

hrStorageUsed 1.3.6.1.2.1.25.2.3.1.6 

hrSystemNumUsers 1.3.6.1.2.1.25.1.5 

if InDiscards 1.3.6.1.2.1.2.2.1.13 

if InErrors 1.3. 6. 1.2. 1.2. 2.1. 14 

if InNUcastPkts 1.3. 6. 1.2. 1.2. 2.1. 12 

iflnOctets 1.3.6.1.2.1.2.2.1.10 

if InUcastPkts 1.3.6.1.2.1.2.2.1.11 

if InUnknovmProtos 1.3. 6. 1.2. 1.2. 2.1. 15 

if LastChange 1.3.6.1.2.1.2.2.1.9 

ifOperStatus 1.3.6.1.2.1.2.2.1.8 

if OutDiscards 1.3.6.1.2.1.2.2.1.19 

ifOutErrors 1.3.6.1.2.1.2.2.1.20 

if OutNUcastPkts 1.3.6.1.2.1.2.2.1.18 

ifOutOctets 1.3.6.1.2.1.2.2.1.16 

if OutUcastPkts 1.3.6.1.2.1.2.2.1.17 

if Speed 1.3.6.1.2.1.2.2.1.5 

ipForwDatagrams 1.3.6.1.2.1.4.6 

To compile your MIB, you can edit the existing PCM file for your MIB or use 
the nhCompileMib command. 

Editing the PCM File 

If you are creating an MTF for which a PCM file was supplied, you must edit 
the PCM file only. However, before you reinstall or upgrade Network Health, 
you should make a copy of this PCM file. 

To edit an existing PCM file, you add the name and OID of your variable. The 
OID must include the indicator for the table in which the variable resides and 
the indicator for the instance of that variable. For example, the variable 
iflnOctets is the tenth instance in table 1: 

iflnOctets 1.3.6.1.2.1.2.2.1.10 
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Note 

The OID for your variable must use the SNMPvl format. 

Using the nhCompileMib Command 

To use the nhCompileMib command: 

1. Make sure that both your MTF and MIB files are in / nethealth/ poller. 

2. Optionally, use one of these commands to source the Network Health 
resource file that is appropriate for your shell environment 



Table 2-4: Commands Used to Source Network Health 
Resource Files 



Shell 


Command 


Bourne 


. nethealthrc. sh 


C 


source nethealthrc . csh 


Korn 


. nethealthrc . ksh 


Note 



If you do not source the resource file, change to the / nethealth/bin 
directory or specify that directory in your Network Health 
commands. 

3. Enter the following command: 

nefcheal fch/bin/nhCompileMib -a mibFile. mib > & mi jbFi I e.mib.p cm 

Note 

You must include the ampersand (&) with the redirect (>). 

The resulting PCM file contains all OIDs from both your MIB file and 
the /nerteaitft/poller/nhCommonMib file. 

If you have difficulty compiling the MIB, you must edit it to remove or 
change offending lines. Some vendor's MIBs do not easily compile unless you 
edit the file. You should delete any tables or individual variables from the 
MIB that your MTF does not need. 
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Specifying Null Data 

If the agent type for an MTF does not support a variable, the poller will still 
report on it by default, assigning it a value of zero. However, if you assign 
"null data" to an MTF variable, Network Health disregards it when polling. 
As a result, in the Trend report, Network Health does not present the variable 
as valid. Rather than assigning a value of 0 to the data, it considers the data to 
be nonexistent. 

If you do not define variables 1 through 30 for a given MTF, Network Health 
considers them to be null data. If you omit them from the file, it considers 
them to be invalid for elements assigned to that MTF. However, it interprets 
the latencyMsec and availableTime variables differently. Network Health 
assumes that all polled elements support the values for these functions. 
Therefore, by default, it considers their data to be valid. 

If a device does not support the latencyMsec or availableTime variable, you 
must assign null data to the appropriate variable in the MTF using the 
following syntax: 

latencyMsec = nullDataO 

availableTime = nullDataO + deltaTime/100 

Note — 

Since Health reports do not support the null data feature, the MTF assigns 
a default value of zero to the nullDataO function. Therefore, you must 
append the additional calculation to the availableTime variable to prevent 
Health reports from generating exceptions for an availability of zero. 

Adding Agents to the List of Agent Types 

After constructing the MTF, you must add the agent that you defined to the 
list of agent types in /netfteaitft/poller/agent.types. Network Health uses this 
file to provide the agents for the Poller Configuration, Modify Element, and 
Add Element dialog boxes. 

Note — 

If you reinstall or upgrade Network Health, follow this procedure to add 
your agents to the agent types file. 
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To add an agent: 

1 . Change to the /nethealth/ 'poller directory. 

2. Rename the agent.types file as agent.types.bak to retain the previous 
version. 

3. If you are adding a response path MTF, delete or rename the 
dataSourcelnfo.ddi file. 

4. Quit the Network Health console. Select Console -> Quit. 

5. Restart the console using the nethealth command. Network Health 
automatically recreates the agent.types and the dataSourcelnfo.ddi files. 

Polling Your Elements 

Before you can poll the elements for which you created an MTF, you must 
perform the following tasks: 

• Restart the Network Health server. 

• Assign the newly created agents to your elements. 

• Add your elements to the poller configuration. 

Restarting Network Health 

After creating an MTF, you must restart your Network Health server by 
following these steps: 

1. Stop the Network Health server. Select Console -» Stop Server. 

2. Restart the Network Health server. Select Console -» Start Server. 

Assigning Agents to Elements 

For Network Health to collect and store data relating to the variables that you 
defined in your MTF, you must assign your new agent type to each element. 
Use the Poller Configuration dialog box to modify the agent type of existing 
elements. Refer to the Network Health User Guide for instructions on modifying 
elements. If you have created your own element type, Network Health cannot 
locate your elements during the discover process. However, it might locate 
those elements using a different agent type. 



Network Health Customizing Variables 2-19 



2 CREATING AN MTF 

Polling Your Elements 



Adding Elements to the Poller Configuration 

If your elements do not appear in the Poller Configuration dialog box, you 
must add your elements. Refer to the Network Health User Guide for 
instructions on adding elements. To create variable labels for use in Network 
Health Trend reports, refer to Chapter 3. 
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Adding Variable Labels 



To make your variables available to Network Health reports, you need to add 
labels for these variables to the Network Health database. You can modify 
four files to add labels and then update the database using the nhConvertDb 
command. This chapter describes how to modify these four files. 



To add labels to the database, Network Health provides the following four 
files, located in / nethealth/ db/ data: 

• elementType.usr 

♦ columnExpression.usr 

• variable.usr 

♦ elementTypeVariable.usr 

By default, each of these files does not contain any data. Network Health 
provides a .sys version of each file that contains the default labels used by 
Network Health. 



Do not modify the .sys files. To add your labels, you must modify the .usr 



Modifying User Files 



Note 



files. 
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When modifying the .usr files to add variable labels, keep in mind the 
following rules: 

• Always place a vertical bar ( | ) between fields. 

• Do not place a vertical bar within the data or at the end of the row. 

• Always enter a value in each field; fields cannot be blank. 

• Do not use tabs; tabs are not supported. 

• Always end the last entry with a single carriage return. Do not add any 
blank lines to the file. 

• To add your variable labels to the database, you may not need to modify 
all of the files. 

The remainder of this section describes the .usr files and explains when and 
how to modify them. 

Modifying the elementType.usr File 

You only need to add an entry in the elementType.usr file if you are creating a 
new element type (that is, if you specified a user value for the mediaType 
statement in your MTF) . 

For each new element, you must add an entry in the elementType.usr file. 
This file associates your element type with a standard Network Health 
element type. 

Note — 

You do not need to add entries to the elementType.usr file if you are not 
creating a new element. 

Table 3-1 lists the elementType.usr fields and their required values. 



3-2 
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Table 3-1: The elementType.usr Fields 



rieia 


Description 


ELEMENTJTYPE 


The value that you assigned the mediaType 




variable in your MTF. 


RPT_ALIAS_PITEM_TYPE 


The Network Health element type to 




associate with your element type. Specify 




one of the following values: 




0 Ethernet LAN 




1 Token Ring LAN 




2 MIB2 LAN 




100 WAN 




101 Frame Relay 




102 MDBS 




105 ATM Port 




106 ATM Path 




107 ATM Channel 




200 Router 




201 Router with Cache 




250 Router CPU 




25 1 Router CPU with Cache 




300 Server 




30 1 Server with no Virtual Memory 




302 Server with no Memory 




303 BMC Windows NT Server 




304 BMC UNIX Server 




305 Empire Windows NT Server 




306 Empire UNIX Server 




330 Server CPU 




350 User Partition 




352 BMC Windows NT Partition 




353 BMC UNIX Partition 




370 Server Disk 




371 BMC Server Disk 




502 Server LAN 




600 Server WAN 




700 Modem 




701 ISDN Interface 




725 Remote Access Server 




750 RAS CPU 




775 Modem Pool 
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Table 3-1: The elementType.usr Fields (Continued) 



Field 


Description 


RPT_ALIASJPITEM_TYPE 
(continued) 


800 Network Path 

803 FirstSense 

805 Empire Service Response 

825 Application Client 

900 Application Server 

1 000 Traffic Accountant Probe 

3000 System Partition 

3001 BMC NT System Partition 

3002 BMC UNIX System Partition 

3100 UNIX Process Set 

3101 NT Process Set 

3200 UNIX Process Set Excluded 

3201 NT Process Set Excluded 

3300 UNIX Process 

3301 NT Process 


ELEMENT_CLASS 


Specify 1. 


LABEL 


A label for the element. You can specify up 
to 32 characters. 


WEB_LABEL 


A label for the element in the Web interface 
for the list of elements in the Run Trend 
Report page. You can specify up to 64 
characters. 



The elementType.usr file should contain fields with a format that is similar to 
those in the elementType.sys file, as shown in this example: 



0 | 


0 | 


1 


Ethernet 


| Ethernet 


1 1 


1 | 


1 


Token Ring 


| Token Ring 


2 j 


100| 


1 


MIB2LAN 


|MIB2 Lan Port 


100 | 


100| 


1 


WAN 


j WAN 


101 1 


101 1 


1 


Frame Relay- 


| Frame Relay 


105 | 


100 | 


1 


ATM Port 


|ATM Port 


106 | 


101| 


1 


ATM Path 


| ATM Path 


200 | 


200 | 


1 


Router 


| Router/Switch 


201 | 


201 | 


1 


Router 


| Router with 1 CPU 
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Network Health uses the elementType files to determine the element type to 
use for your element when generating reports and displaying in list form, 
such as lists for groups. In the above file, MIB2 LAN has an ELEMENT_TYPE 
of 2 and RPT_ALIAS_PITEM_TYPE of 100, which is a WAN element type. 
Network Health uses a WAN element type for all MIB2 LAN elements. 

To add an entry for your element: 

1. In the first field, specify the value (without a minus (-) sign) that you 
assigned to mediaType in your MTF. 

2. In the second field, associate your element type with a standard Network 
Health element type. 

3. In the third field, specify 1. 

4. In the fourth and fifth fields, create appropriate labels for your element. If 
you are creating your own element type, provide a unique Web label so 
that Network Health can display your element in lists (such as Trend 
reports) on the Web. 

Modifying the variable.usr File 

To create a unique label for variables in your MTF, you must add an entry in 
the variable.usr file. This label appears in the list of variables for running a 
Trend report on your element type. You can use existing Network Health 
labels that are defined in the variable.sys file for your variables. 

Table 3-2 lists the variable.usr fields and their required values. 

Note 

You do not need to add entries to the variable.usr file if you are using 
existing variable labels. 



Table 3-2: The variable.usr Fields 



Field 


Description 


VARJD 


A unique number to identify the variable. 
Specify the following ranges: 

Original Equipment Manufacturer (OEM): 
900,000 up to 1,000,000 

End-user: 1,000,000 and above 
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Table 3-2: The variable.usr Fields (Continued) 



Field 


Description 


UNITS JD 


A number indicating the type of units used to 
measure this variable. Specify the following values: 

0 Rate as a counter 

1 Bytes as a counter 

2 Frames as a counter 

3 Errors as a counter 

4 Percent as a gauge 

5 Per second as a gauge 

6 Buffers as a gauge 

7 Bytes as a gauge 

8 Cells as a counter 

9 Pages as a gauge 

10 Total time as a gauge 

1 1 Milliseconds as a gauge 

12 Per call minute as a gauge 

13 Gauge as a gauge 

1 4 Bits per call second as a counter 

15 Bits as a counter 

1 6 Minimum milliseconds as a gauge 

1 7 Maximum milliseconds as a gauge 

18 Transactions as a counter 

19 Size as an aggregate value that can be either 
an average (for gauge percentage values) or a 
total (for counter values) 


LABEL 


A label used to identify the variable in a list. You can 
specify up to 32 characters. Spaces are permitted. 


SHORTJLABEL 


A shorter label for the variable. You can specify up 
to 16 characters. Spaces are permitted. 


SYMBOL 


An internal identifier for the variable. It must be 
unique for the element type. You can specify up to 
32 characters. Spaces are not permitted. 
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The variable.usr file should contain fields that are similar in format to the 
variable.sys file, as shown in the following example: 



1 1 


2 


Frames 


| Frames 


frames 


2 I 


1 


Bytes 


| Bytes 


bytes 


3 j 


2 


Broadcasts 


| Broadcasts 


broadcasts 


4 I 


2 


Multicasts 


| Multicasts 


multicasts 


5 | 


2 


Alignment Errors 


| Alignment Errors 


alignment Errors 


6 | 


2 


Collisions 


| Collisions 


collisions 


7| 


2 


Errors 


| Errors 


errors 


8 | 


2 


TR Abort Errors 


|TR Abort Errors 


abortErrors 


9 j 


2 


TR Burst Errors 


|TR Burst Errors 


burstErrors 



Add an entry in the variable.usr file only for those variables for which you 
want to create your labels. 

To add an entry: 

1 . In the first field, assign each variable a VARJD using the ranges listed in 
Table 3-2. 

2. In the second field, indicate the type of units for measuring the data. 

3. In the third field, create a label that is unique for your element type. You 
can specify up to 32 characters. This label appears in the list of available 
variables in the Run Trend Report dialog box when your element is 
selected. 

4. In the fourth field, specify the same label, but truncate it to 16 characters. 

5. In the fifth field, specify the same label, but omit spaces and begin with a 
lowercase letter. The label you enter in this field must also be unique for 
your element type. 

Modifying the columnExpression.usr File 

The columnExpression files identify a column or a formula for the data that 
you are storing in the database. Add entries in the columnExpression.usr file 
only if you want to use a formula such as a variable derivative for your data. 
You do not need to add entries to this file if you are only storing data in one of 
the columns that is available for your use. For example, if you want to store 
the total number of bytes, you could create a column expression that uses the 
following formula: 

B YTES_IN+ B YTES_OUT 
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Note 

The above formula is an existing column expression with a column ID 
(COL_ID) of 85 in the columnExpression.sys file. 

Table 3-3 lists the columnExpression.usr fields and their required values. 



Table 3-3: The columnExpression.usr Fields 



Field 


Description 


COLJD 


A number to identify the column. If you are creating 
your own column expressions, specify these ranges: 

OEM: 900,000 up to 1,000,000 
End-user: 1,000,000 and above 


COL.EXPRESSION 


A string of up to 255 characters that describe the 
column or the formula for the data. 



The columnExpression.usr file should contain fields that are similar in format 
to the columnExpression.sys file, as shown in the following example. 

30 |bytes_out 

3 1 | dlljtransits+dll_enet_frames 
3 2 | dll_errors - dll_colli s ions 

33 | ( tr_lost_frame -dll__frames ) - tr_burst - tr_conge s t i on - tr_cont 
ent i on_s treami 

34 | ( FLOAT 4 (TR_CONTENTION_STREAMING) / FLOAT4 (TR_B I T_S TREAMI NG) 
) *DELTA_TIME* 1 

43 |UTIL 

Note 

You can include combinations of variables in the COLJEXPRESSION 
string, similar to Columns 31 through 34 in this example. 

The COLJD associates a variable with a column expression in the database. If 
you are just using one of the 30 columns and do not need a formula, you do 
not need to create an entry in the columnExpression.usr file. Refer to 
Table A-l on page A-2 for a list of the column IDs for variablel through 
variable30 and the associated column expression. 

If you want to add an entry to the columnExpression.usr file, select an 
identifier using the range listed in Table 3-3. You can use the 
columnExpression.sys file as a template for creating your column expression. 
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Modifying the elementTypeVariable.usr File 

The elementTypeVariable.usr file associates your variable with an element 
type, a variable ID, and a column. You must create an entry in the 
elementTypeVariable.usr file for each variable that you created in your MTF. 
Table 3-4 lists the elementTypeVariable.usr fields and their required values. 



Table 3-4: The elementTypeVariable.usr Fields 



Field 


Description 


ELEMENT.TYPE 


The element identifier from either the 
elementType.sys or the elementType.usr file. 


VARJD 


The variable identifier from either the variable.sys 
or the variable.usr file for the variable. 


DATA_SRC 


Specify 1. 


COLJD 


The column identifier from either the 
columnExpression.sys or the columnExpression.usr 
file for the variable. 



The elementTypeVariable.usr file should contain fields that are similar in 
format to the elementTypeVariable.sys file, as shown in the following 
example: 



0 | 


1 1 


1 1 


1 


0 | 


2 | 


1 1 


2 


0 | 


3 1 


l| 


4 


°i 


4 | 


1 1 


3 


0 j 


5 | 


1 j 


11 


o| 


6 | 


l| 


9 


o| 


7 | 


l| 


10 


oj 


118 j 


ij 


57 


0| 


119 j 


1 j 


58 


0 | 


120 | 


l| 


59 


oj 


121 | 


1 j 


60 



To add entries: 

1 . For the first field, select an existing element type from the elementType.sys 
file. If you are creating elements, you must specify the ELEMENTJTYPE 
value that you added to the elementType.usr file. 
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2. For the second field, specify the variable ID from the variable.sys file. If 
you created your own variable label, specify the VARJD that you created 
in the variable.usr file. 

3. For the third field, specify 1. 

4. For the fourth field, specify a column ID from the columnExpression.sys 
file if you are using an existing column expression. For example, if you 
assigned your MIB variable to variable26, specify 26 for the COLJD. If 
you create a column expression for a formula, specify the COLJD that you 
added to the columnExpression.usr file. 

Updating the Database 

To update the Network Health database with your changes to the label tables, 
you use the nhConvertDb command. The command has the following syntax: 

nhConvertDb database 

The database variable specifies the name of the database to convert. This is 
required. 

To run the nhConvertDb command: 

1. Stop the Network Health server. Select Console -» Stop Server. 

2. If you used the default database name, nethealth, enter the following 
command: 

/nethealth/bin/nhConvertDb nethealth 

3. When the database conversion finishes, restart the Network Health server. 
Select Console Start Server. 
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Each type of element has a set of database columns that are fixed and 
reserved for Network Health, a set of database columns reserved for use by 
original equipment manufacturers (OEM) or third parties, and a set of 
database columns reserved for users. OEMs and users can add their own 
variables (such as error- free seconds) to the second and third sets of columns. 

For the following element types, the tables in this appendix list the database 
columns (by MTF variable name) and the purpose of each. Variables not 
reserved by Network Health are available to OEMs or users. 

• Ethernet 

• Token Ring 

• WAN 

• MIB2 LAN and MIB2 LAN Full Duplex 

• Frame Relay 

• ATM Ports 

• ATM Paths 

• ATM Channels 

• Routers 

• Router CPUs 

• Servers 

• Server CPUs 

• Server Partitions 

• Server Disks 

• Server Interfaces 

• Server Process Sets 

• Remote access server (RAS) devices 
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• Modem pools 

• Modems 

• ISDN interfaces 

• Network paths 

• Network paths for DNS 

• Network paths for HTTP 

• Network paths for voice over IP 

• Network Paths for FirstSense 

• Network Paths for Empire Service Response 

In addition, Table A-l lists the column IDs and the associated column 
expressions for variablel through variable30 as defined in the 
columnExpression.sys file. 

Table A-1: Column IDs and Column Expression 



COLJD 


COLUMN_EXPRESSION 


1 


DLLFRAMES 


2 


DLL_BYTES 


3 


DLL_MCASTS 


4 


DLL_BCASTS 


5 


DLL_RCV_OFF_FRAMES 


6 


DLL_XMT_OFF_FRAMES 


7 


DLL_TRANSITS 


8 


DLL_ENET_FRAMES 


9 


DLL_COLLISIONS 


10 


DLL.ERRORS 


11 


DLL_ALGN_ERRORS 


12 


TR_SET_RECOVERY_MODE 


13 


TR_SIGNAL_LOSS 


14 


TR_BIT_STREAMING 


15 


TR_CONTENTION_STREAMING 


16 


TR_LINE 
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Table A-1: Column IDs and Column Expression (Continued) 



COLJD 


COLUMN.EXPRESSION 


17 


TR.BURST 


18 


TRJNTERNAL 


19 


TR.ABORT 


20 


TR_ADDRESS_COPIED 


21 


TR.CONGESTION 


22 


TR_LOST_FRAME 


23 


TR_TOKEN 


24 


TR.FREQUENCY 


25 


TR_FRAME_COPIED 


26 


TR_LLC_FRAMES 


27 


PACKETS JN 


28 


BYTES JN 


29 


PACKETS_OUT 


30 


BYTES_OUT 



Table A-2: Column Allocations for Ethernet Elements 



MTF Variable 


Description 


variable 1 


Number of frames 


variable2 


Number of bytes 


variable3 


Number of multicasts 


variable4 


Number of broadcasts 


variable5 


Available for OEM use 


variable6 


Available for OEM use 


variable7 


Available for OEM use 


variable8 


Available for OEM use 


variable9 


Number of collisions 


variable 10 


Number of errors 
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Table A-2: Column Allocations for Ethernet Elements (Continued) 



MTF Variable 


Description 


variable 11 


Number of alignment errors 


variable 12 


Number of non-unicast frames (in) 


variable 13 


Number of deferred frames (out) 


variable 14 


Number of discards total (in+out) 


variable 15 


Reserved 


variable 16 


Number of unknown protocol packets 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable20 


Reserved 


variable21 


Reserved 


variable22 


Number of frames (in) 


variable23 


Number of bytes (in) 


variable24 


Number of errors (in) 


variable25 


Number of discards (in) 


variable26 


Available for use 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-3: Column Allocations for Token Ring Elements 


MTF Variable 


Description 


variable 1 


Number of frames 


variable2 


Number of bytes 


variable3 


Number of multicasts 


variable4 


Number of broadcasts 
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Table A-3: Column Allocations for Token Ring Elements (Continued) 



MTF Variable 


Description 


variables 


Available for OEM use 


variable6 


Available for OEM use 


variable7 


Available for OEM use 


variable8 


Available for OEM use 


variable9 


Available for OEM use 


variable 10 


Number of errors 


variablell 


Available for use 


variable 12 


Number of Token Ring Beaconing Event 1 


variable 13 


Number of Token Ring Beaconing Event 2 


variable 14 


Number of Token Ring Beaconing Event 3 


variable 15 


Number of Token Ring Beaconing Event 4 


variable 16 


Number of Token Ring soft error type 1 


variable 17 


Number of Token Ring soft error type 2 


variable 18 


Number of Token Ring soft error type 3 


variable 19 


Number of Token Ring soft error type 4 


variable20 


Number of Token Ring soft error type 5 


variable21 


Number of Token Ring soft error type 6 


variable22 


Number of Token Ring soft error type 7 


variable23 


Number of Token Ring soft error type 8 


variable24 


Number of Token Ring soft error type 9 


variable25 


Number of Token Ring soft error type 10 


variable26 


Number of Token Ring logical link frames 


variable27 


Number of frames inbound on an interface 


variable28 


Number of bytes inbound on an interface 


variable29 


Number of frames outbound on an interface 


variable30 


Number of bytes outbound on an interface 
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Table A-4: Column Allocations for WAN Elements 



MTF Variable 

If III VUI iUMIv 


Descrintion 


VaridUlcl 


INUlllUCl UI lldlllco 


Vdl iaUlct- 


MumhpT* of Vi\/tpc [inl 


\/sir*T?*l*l1p'3 
veil lauicj 


TSIiimhpr of noti-1 inira<?t framps fini 

i N L4i 1 1 L/Cl Ul 11U11 Ul llv«ClO I HCllllCO yillj 


VaTlaUlc^i 


irYiVtOT* itF nnn- 1 i*nir*5ict framoc fin-uni ltl 
INUlllUCl Ul llUIl"*UlilLdol lldlilCJ V 1I1^UUL/ 


va.na.uic3 


AvdlldUlc 1UI KJJulvl Llot: 


\/a T"i 51 nl ph 

VCll IdUlCU 


Avaitahlp frir OPA/t iiQP 

VClllClLJiC 1VJ1 \_/J_ilVl uoc 


VaTldUic/ 


Mi itti nor of nnoiiD Hrnnc mt*i1 
INUlllUCl Ui l^UCUC UI UUo U'v 


variauico 


INUlllUCl Ul UjUcUc UI Upb ^UUIJ 


A/orianl pQ 
Va.1 ldUJ.CC/ 


INJi iTnHpr nf HiQrarHpH frampc Mril 
iNUiiiuci ui uiotai ucu ii ail ico 


Veil Jauiciu 


INInmbpr nf all prrnr^ (minii^ Hi^rardO (\r\\ 
iNUiiiuci \jL an ciiuio ^iniiiuo uiovcli no/ Vr'V 


■wot*"* Ci r"*l 0 1 1 
Vdl ldUlcl 1 


1\CoC1 vcu 


\ /an Q r*l P 1 V 
Vdl laUIclc 


T? ocon 7Pri 
JXcocl VcU 


Vdl ldUlCAO 


T?pcprvpH 
JACoCl VCU 


Vdl ldUlc 1*1 


T?PCOr , \7Pf"l 
IxCocl VcU 


\7^ri5tH1p1 ^ 

Vdl idUlC J.U 


TPpcprvpri 


Vdl IdUlCiU 


INUlllUCl Ui Ui Irvl 1U W 1 1 UlUlUtUlo v 1 "/ 


VdridDlcl / 


/-VVdlldUlc IOT VJHIVI use 


\/ciri5iV»1p1 8 
Vdi leiuic i o 


R pcon/pH 

l\CoCl V cu 


variable 19 


Reserved 


variable20 


Reserved 


variable 21 


Reserved 


variable22 


Number of frames (in+out) 


variable23 


Number of bytes (in+out) 


variable24 


Number of all errors (minus discards) (in +out) 


variable25 


Number of discarded frames (in+out) 


variable26 


Available for OEM use 
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Table A-4: Column Allocations for WAN Elements (Continued) 



MTF Variable 


Description 


variable27 


Available for OEM use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A- 5: Column Allocations for MIB2 LAN and MIB2 LAN Full Duplex 
Elements 


MTF Variable 


Description 


variable 1 


Number of frames (in) 


variable2 


Number of bytes (in) 


variable3 


Number of non-unicast frames (in) 


variable4 


Number of non-unicast frames (in+out) 


variables 


Number of collisions (out) 


variable6 


Number of deferred frames (out) 


variable7 


Number of queue drops (in) 


variable8 


Number of queue drops (out) 


variable9 


Number of discarded frames (in) 


variable 10 


Number of all errors (minus discards) (in) 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Number of unknown protocols (in) 


variable 17 


Number of alignment errors 


variable 18 


Reserved 


variable 19 


Reserved 
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Table A-5: Column Allocations for MIB2 LAN and MIB2 LAN Full Duplex Elements 
(Continued) 



MTF Variable 


Description 


variable20 


Reserved 


vanable21 


Reserved 


variable22 


Number of frames (in+out) 


variable23 


Number of bytes (in+out) 


variable24 


Number of all errors (minus discards) (in+out) 


variable25 


Number of discarded frames (in+out) 


variable26 


Available for OEM use 


variable27 


Available for OEM use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-6: Column Allocations for Frame Relay Elements 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Reserved 


variable3 


Reserved 


variable4 


Available for OEM use 


variables 


Available for OEM use 


variable6 


Available for OEM use 


variable7 


Available for OEM use 


variable8 


Available for OEM use 


variable9 


Available for use 


variable 10 


Number of all errors 


variable 11 


Available for use 


variable 12 


Number of BECNs (in) 
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Table A-6: Column Allocations for Frame Relay Elements (Continued) 



MTF Variable 


Description 


variable 13 


Number of BECNs (out) 


variable 14 


Number of FECNs (in) 


variablel5 


Number of FECNs (out) 


variablel6 


Number of discards 


variable 17 


Number of discard eligible drops 


variable 18 


Number of non-discard eligible drops 


variable 19 


Number of drops 


variable20 


Number of discard eligible frames (in) 


variable21 


Number of discard eligible frames (out) 


variable22 


Number of discard eligible bytes (in) 


variable23 


Number of discard eligible bytes (out) 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Number of frames (in) 


variable28 


Number of bytes (in) 


variable29 


Number of frames (out) 


variable30 


Number of bytes (out) 


Table A-7: Column Allocations for ATM Ports 


MTF Variable 


Description 


variable 1 


Number of cells (in) 


variable2 


Number of bytes (in) 


variable3 


Reserved 


variable4 


Available for OEM use 


variables 


Available for OEM use 


variable6 


Errored seconds 
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Table A-7: Column Allocations for ATM Ports (Continued) 



MTF Variable 


Description 


variable7 


Severely errored seconds 


variable8 


Unavailable seconds 


variable9 


Number of discards (in) 


variable 10 


Number of errors (minus discards) (in) 


variable 11 


Number of AAL5 PDUs received 


variable 12 


Number of AAL5 PDUs transmitted 


variable 13 


Number of AAL5 Received PDUs dropped 


variable 14 


Number of AAL5 Transmitted PDUs dropped 


variable 15 


CLP1 discards total 


variable 16 


CLP1 discards in 


variable 17 


CLP1 cells total 


variable 18 


CLP1 cells in 


variable 19 


Reserved 


variable20 


Reserved 


variable21 


Reserved 


variable22 


Number of cells (in+out) 


variable23 


Number of bytes (in+out) 


variable24 


Number of errors (in+out) 


variable25 


Number of discards (in+out) 


variable26 


Policy violations total 


variable27 


Policy violations in 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 
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Table A-8: Column Allocations for ATM Paths 



MTF Variable 


Description 


variable 1 


Number of AAL5 received PDUs dropped 


variable2 


Number of AAL5 transmitted PDUs dropped 


variable3 


Number of AAL5 PDUs received 


variah1p4 

V Cll lUUJV^i 


Available for OEM use 


variaHIp*! 

V Cll 1CXU X\^*J 


Available for OEM use 


variablpfi 

VCII id UJUU 


Available for OEM use 


v aria hip 7 


Available for OEM use 


variahlpR 

V Cll XCLUl\^\J 


Available for OEM use 


V Cll J. CliJJ.ee? 


Number of AAL5 PDUs transmitted 


Veil iduicxu 


Ppsprvpd 


vdl laUici x 


Rpsprved 


Veil laUicifci 


Number of discards (in) 


\/onn WIpI ^ 
V Cll la U1C 1 0 


Numhpr of discards (out) 


■variablpl 4 
veil iciuir; at 


Number of CLP1 discards total 


V dl laUlC l. tJ 


Number of CLP1 discards in 


Vdl ldUlclU 


Numhpr of maximum channels (in) 


vpiriahlpl 7 

Vdl laUlC X l 


Number of allocated channels (in) 


variable 18 


Number of CLP1 cells total 


variable 19 


Number of CLP1 cells in 


variable20 


Number of maximum channels (out) 


variable21 


Number of allocated channels (out) 


variable22 


Available for use 


variable23 


Available for use 


variable24 


Number of policy violations total 


variable25 


Number of policy violations in 


variable26 


Available for use 
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Table A-8: Column Allocations for ATM Paths (Continued) 



MTF Variable 


Description 


variable27 


Number of cells (in) 


variable28 


Number of bytes (in) 


variable29 


Number of cells (out) 


variable30 


Number of bytes (out) 


Table A-9: Column Allocations for ATM Channels 


MTF Variable 


Description 


variable 1 


Number of AAL5 received PDUs dropped 


variable2 


Number of AAL5 transmitted PDUs dropped 


variable3 


Number of AAL5 PDUs received 


variable4 


Available for OEM use 


variable5 


Available for OEM use 


variable6 


Available for OEM use 


variable7 


Available for OEM use 


variable8 


Available for OEM use 


variable9 


Number of AAL5 PDUs transmitted 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Number of discards (in) 


variable 13 


Number of discards (out) 


variable 14 


Reserved 


variable 15 


Number of CLP1 discards total 


variable 16 


Number of CLP1 discards total (in) 


variable 17 


Number of CLP1 cells total 


variable 18 


Number of CLP1 cells in 


variable 19 


Reserved 


variable20 


Reserved 
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Table A-9: Column Allocations for ATM Channels (Continued) 



MTF Variable 


Description 


variable21 


Number of policy violations (total) 


variable22 


Number of policy violations (in) 


variable23 


Available for use 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Number of cells (in) 


variable28 


Number of bytes (in) 


variable29 


Number of cells (out) 


variable30 


Number of bytes (out) 


Table A-10: Column Allocations for Routers 


MTF Variable 


Description 


variable 1 


Number of frames (in) 


variable2 


Number of bytes (in) 


variable3 


Number of non-unicast frames (in) 


variable4 


Average line utilization 


variables 


Average discard rate 


variable6 


Average packet fault rate 


variable7 


Number of input queue drops (total) 


variable8 


Number of output queue drops (total) 


variable9 


Number of discards (in) 


variable 10 


Number of all errors (minus discards) (in) 


variable 11 


Number of slow packets (in) 


variable 12 


Number of slow packets (out) 


variable 13 


Number of fast packets (in) 


variable 14 


Number of fast packets (out) 
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Table A-10: Column Allocations for Routers (Continued) 



MTF Variable 


Description 


variable 15 


Number of bridged packets 


variable 16 


Number of unknown packets 


variable 17 


Number of IP packets 


variable 18 


Number of DECnet packets 


variable 19 


Number of XNS packets 


variable20 


Number of Appletalk packets 


variable21 


Number of forward IPX packets 


variable22 


Number of frames (total) 


variable23 


Number of bytes (total) 


variable24 


Number of errors (minus discards) (total) 


variable25 


Number of frames discarded (total) 


variable26 


Number of non-unicast frames (total) 


variable27 


Reserved 


variable28 


Reserved 


variable29 


Reserved 


variable30 


Reserved 


Table A-11: Column Allocations for Router CPUs 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Reserved 


variable3 


Reserved 


variable4 


Available for OEM use 


variables 


Available for OEM use 


variable6 


Available for OEM use 


variable7 


Available for OEM use 


variable8 


Available for OEM use 
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Table A-11: Column Allocations for Router CPUs (Continued) 



MTF Variable 


Description 


variable9 


Reserved 


variable 10 


Available for use 


variable 11 


Number of bus drops 


variable 12 


CPU count 


variable 13 


Free memory 


variable 14 


Total number of buffers 


variable 15 


Number of buffers used 


variable 16 


Number of small buffer hits 


variable 17 


Number of small buffer misses 


variable 18 


Number of medium buffer hits 


variable 19 


Number of medium buffer misses 


variable20 


Number of big buffer hits 


variable21 


Number of big buffer misses 


variable22 


Number of large buffer hits 


variable23 


Number of large buffer misses 


variable24 


Number of huge buffer hits 


variable25 


Number of huge buffer misses 


variable26 


Available for use 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Number of buffer create failures 
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Table A-12: Column Allocations for Servers 



MTF Variable 


Descriotion 


variable 1 


Number of oaf?e ins 


variable2 


CPU load average 


variable3 


Number of page outs 


variable4 


Number of Daee swat) ins 


variable!") 


Numhpr nfnaap cwan mit<N 

1 N mxlU\Zl Ul page OVVCIL/ vJLllO 


\/5i'r i ia ril pR 
Val lauicu 


Mi itnhpr nf flip pupVip Vntc 

1NU11JUCI Ul 111C L-CtL-llC liiLo 


Val laulC i 


l\Tl ITYlnPT" PIT flip PUPrlQ miccoc 


variablpR 

V Cll IClUlC-vJ 


Tntal ■nhvQiral Tnpmnn/ 

lULCU UlljfOlL.CU iiiciiiui y 


variahlpQ 


Phv^iral TTTiPmorv i iqpH 
i liyoi^cii iirciiiui y u.jCvj 


variahlpl 0 

V Cll IflUJC J. u 


TVnmhpr nf naop faults 


variahlpl 1 


Avpraop fPT I i iti1i7atinn 

AVClQtC V^l W LI tilled LI Ul I 


variahlpl ? 

V Cll ICILJIC J. t_i 


fPl T imhalanrp 

V_/i l_J 11 1 lUCllCll 1LC 


variahlpl *3 
veil iciuici o 


l\Ji irnhpr* r»"P intprnmt"C 

1NU111UC1 Ul 11 ILC1 1 ULJlO 


variable 14 


Numhpr nf artivp ronnprtirvns 


variahlpl ^ 


lSJiiTYihpr nf Hrnirnprl mnnpptinTi^ 

1 M U111UCJ. Ul Ul UUUCU ^_-Ul 11 lCV^LlUl lo 


variahlpl fi 

VCll IClUlt'lU 


Total virtual mpmnrv 

1 ULCU V11LU.C11 lllWlllUi y 


variablel7 


Virtual mpmnrv iispH 


variable 18 


Number of small communication buffers dropped 


variable 19 


Total number of large communication buffers 


variable20 


Number of large communication buffers used 


variable21 


Number of page scans 


variable22 


Number of system calls 


variable23 


Number of processes 


variable24 


Sum of errors (in+out) and discards 


variable25 


Sum of discards (in+out) 


variable26 


Total CPU utilization 



A-1 6 Network Health Customizing Variables 



DATABASE COLUMN ASSIGNMENTS 



Table A-12: Column Allocations for Servers (Continued) 



MTF Variable 


Description 


variable27 


Sum of packets (in) 


variable28 


Sum of octets (in) 


variable29 


Sum of packets (out) 


variable30 


Sum of octets (out) 


Table A-13: Column Allocations for Server CPUs 


MTF Variable 


Description 


variable 1 


Available for OEM use 


variable2 


Available for OEM use 


variable3 


Available for OEM use 


variable4 


Available for OEM use 


variables 


Available for use 


variable6 


Available for use 


variable7 


Available for use 


variable8 


Available for use 


variable9 


Reserved 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable20 


Reserved 



Network Health Customizing Variables A-17 



A DATABASE COLUMN ASSIGNMENTS 



Table A-13: Column Allocations for Server CPUs (Continued) 



MTF Variable 


Description 


variable21 


Reserved 


variable22 


Reserved 


variable23 


Reserved 


variable24 


CPU utilization 


variable25 


CPU user time 


variable26 


CPU system time 


variable27 


CPU wait time 


variable28 


CPU idle time 


variable29 


Reserved 


variable30 


Reserved 


Table A-14: Column Allocations for User and System Server Partitions 


MTF Variable 


Description 


variable 1 


Inode utilization 


variable2 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable3 


Available for OEM use 


variable4 


Available for OEM use 


variables 


Available for use 


variable6 


Available for use 


variable 7 


Available for use 


variable8 


Available for use 


variable9 


Reserved 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 
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Table A-14: Column Allocations for User and System Server Partitions 



MTF Variable 


Description 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variablel8 


Reserved 


variable 19 


Reserved 


variable20 


Reserved 


variable21 


Reserved 


variable22 


Reserved 


variable23 


Reserved 


variable24 


Storage capacity 


variable25 


Storage used 


variable26 


Reserved 


variable27 


Partition allocation failures 


variable28 


Reads 


variable29 


Writes 


variable30 


Reads plus writes 


Table A-15: Column Allocations for Server Disks 


MTF Variable 


Description 


variable 1 


Available for OEM use 


variable2 


Available for OEM use 


variable3 


Time spent in the busy state 


variable4 


Queue length 


variable5 


Available for use 


variable6 


Available for use 


variable7 


Reserved 
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Table A-15: Column Allocations for Server Disks (Continued) 



MTF Variable 


Description 


variable8 


Reserved 


variable9 


Reserved 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable20 


Reserved 


variable21 


Reserved 


variable22 


Reserved 


variable23 


Reserved 


variable24 


Storage capacity 


variable25 


Storage used 


variable26 


Reserved 


variable27 


Faults 


variable28 


Reads 


variable29 


Writes 


variable30 


Reads plus writes 
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Table A-16: Column Allocations for Server Interfaces 



MTF Variable 

IVI 1 I VCII IQWIC 


np^rrintion 


vari jiWIpI 

VCll ICIUICI 


inuiiiuci ui iidiiico 


\/ari ah1p9 

Vdi lCXUlCL 


Mi imhpr of" hi/tpc finl 
i\umuci ui uyico vi*v 


Vdl ldUlco 


l\UIHUcI Ul IlUIl - UIllCdi>L 11 dlllcb 


VdIldUlc*i 


l\Ti imKor nf nnn unircict fr-Q m oc linj.ni it 1 
InUIIIUcT UI nOIl-UinLaaL lldlllco \lJl-rUUl| 


\/siri aVilp^ 
Vdl laUlcJ 


Ivcoci VcU 


vanaDieo 


Reserved 


variable7 


Reserved 


vd.na.Dico 


Keservea 


vanauiey 


ixumDer or errors vinj 


variauic iu 


iNumDcr oi discards [iii} 


variahlpl 1 

V Cll 1CLU1C X 1 


W p cp r\ 7pH 
lVCoCl VCU 


\/aH a nip 1 y 

VCll l&UlC 1 Lt 


J\caci VcU 


variauicio 


Keservea 


v\ t\ hi p 1 A 

Veil luUlC 1 


l\CoCl VCU 


1/3 ri sihlp 1 ^ 
Veil laUiCl J 


Ppcpr\;pn 
Xvcocl VcU 


Vdl ldUlclO 


lNurnuer oi unKnown protocols ^liij 


Vdl IdUlcl / 


Ivcocl VcU 


variable 18 


OutPoiriP oupup lenf?th 


variable 19 


Reserved 


variable 20 


Reserved 


variable21 


Reserved 


variable22 


Number of frames (total) 


variable23 


Number of bytes (total) 


variable24 


Number of errors (total) 


variable25 


Number of discards total 


variable26 


Reserved 
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Table A-16: Column Allocations for Server Interfaces (Continued) 





MTF Variable 


Description 


■ 


variable27 


Reserved 


■ 
■ 


variable28 


Reserved 


m 
■ 


yariable29 


Reserved 


mm 
■ 


variable30 


Reserved 


I 


Table A-17: Column Allocations for Server Process Sets 




MTF Variable 


Description 




variable 1 


Reserved 


■ ffi 


variable2 


Average CPU utilization 


mm 


variable3 


Physical memory used 


■ u 


variable4 


Virtual memory used 


■H 


variables 


Number of pages paged 


1 


variable6 


Number of pages swapped 


■If! 


variable7 


Number of disk blocks read 




variable8 


Number of disk blocks written 


|P 


variable9 


Number of incoming network messages 


■^ 


variable 10 


Number of outgoing network messages 


I 


variable 11 


Number of system calls 


1 


variable 12 


Number of threads 


■ 


variable 13 


Number of hard page faults 


1 


variable 14 


Number of soft page faults 


■ 


variable 15 


Number of swaps 


I 


variable 16 


Reserved 


I 


variable 17 


Reserved 


I 


variable 18 


Reserved 


■ 


variable 19 


Reserved 


1 


variable20 


Reserved 
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Table A-17: Column Allocations for Server Process Sets (Continued) 



MTF Variable 


Description 


variable21 


Reserved 


variable22 


Reserved 


variable23 


Reserved 


variable24 


Reserved 


variable25 


Reserved 


variable26 


Reserved 


variable27 


Reserved 


variable28 


Reserved 


variable29 


Reserved 


variable30 


Reserved 


Table A-18: Column Allocations for RAS Devices 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Connection time in call seconds 


variable3 


Number of connect errors 


variable4 


Average CPU utilization 


variables 


CPU imbalance 


variable6 


Number of non-connect (other) errors 


variable7 


Number of octets transmitted 


variable8 


Number of octets received 


variable9 


Number of discards 


variable 10 


Number of frame errors 


variable 11 


Total memory 


variable 12 


Memory used 


variable 13 


Number of retrains 


variable 14 


Number of frames transmitted 


variable 15 


Number of frames received 
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Table A-18: Column Allocations for RAS Devices (Continued) 



MTF Variable 


Description 


variable 16 


Number of connections 


variable 17 


Time spent in the onhook state 


variable 18 


Time spent in the offhook state 


variable 19 


Time spent in the connected state 


variable20 


Time spent in the disabled state 


variable21 


Time spent in the unknown state 


variable22 


Time since the last successful poll 


variable23 


Modems in use/ occupied 


variable24 


Number of modems/ISDN in the RAS 


variable25 


Time spent in the busy state 


variable26 


Time spent in the test state 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-19: Column Allocations for Modem Pools 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Connection time in call seconds 


variable3 


Number of connect errors 


variable4 


Reserved 


variable5 


Reserved 


variable6 


Number of non-connect (other) errors 


variable7 


Number of octets transmitted 


variable8 


Number of octets received 


variable9 


Number of discards 
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Table A-19: Column Allocations for Modem Pools (Continued) 



MTF Variable 


Description 


variable 10 


Number of frame errors 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Number of retrains 


variable 14 


Number of frames transmitted 


variable 15 


Number of frames received 


variable 16 


Number of connections 


variable 17 


Time spent in the onhook state 


variable 18 


Time spent in the offhook state 


variable 19 


Time spent in the connected state 


variable20 


Time spent in the disabled state 


variable21 


Time spent in the unknown state 


variable22 


Time since the last successful poll 


variable23 


Modems in use/ occupied 


variable24 


Number of modems/ISDN in the pool 


variable25 


Time spent in the busy state 


variable26 


Time spent in the test state 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-20: Column Allocations for Modems 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Connection time in call seconds 


variable3 


Number of connect errors 



Network Health Customizing Variables A-25 



A DATABASE COLUMN ASSIGNMENTS 



Table A-20: Column Allocations for Modems (Continued) 



MTF Variable 


Description 


variable4 


Reserved 


variables 


Reserved 


variable6 


Number of non-connect (other) errors 


variable7 


Number of octets transmitted 


variable8 


Number of octets received 


variable9 


Number of discards 


variable 10 


Number of frame errors 


variable 11 


Call transmit rate 


variable 12 


Call receive rate 


variable 13 


Number of retrains 


variable 14 


Number of frames transmitted 


variable 15 


Number of frames received 


variable 16 


Number of connections 


variable 17 


Time spent in the onhook state 


variable 18 


Time spent in the offhook state 


variable 19 


Time spent in the connected state 


variable20 


Time spent in the disabled state 


variable21 


Time spent in the unknown state 


variable22 


Available for use 


variable23 


Occupied flag 


variable24 


Available for use 


variable25 


Time spent in the busy state 


variable26 


Time spent in the test state 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 
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Table A-21: Column Allocations for ISDN Interfaces 



MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Connection time in call seconds 


variable3 


Reserved 


variable4 


Reserved 


variable5 


Reserved 


variable6 


Reserved 


variable7 


Number of octets transmitted 


variable8 


Number of octets received 


variable9 


Number of discards 


variable 10 


Number of frame errors 


variable 11 


Call transmit rate 


variablel2 


Call receive rate 


variable 13 


Reserved 


variable 14 


Number of frames transmitted 


variable 15 


Number of frames received 


variable 16 


Number of connections 


variable 17 


Time spent in the onhook state 


variable 18 


Time spent in the offhook state 


variable 19 


Time spent in the connected state 


variable20 


Time spent in the disabled state 


variable21 


Time spent in the unknown state 


variable22 


Reserved 


variable23 


Connected flag 


variable24 


Reserved 


variable25 


Time spent in the busy state 


variable26 


Time spent in the test state 


variable27 


Available for use 
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Table A-21: Column Allocations for ISDN Interfaces (Continued) 



MTF Variable 


Description 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-22: Column Allocations for Network Paths 


MTF Variable 


Description 


variable 1 


The minimum response time in milliseconds found over 

UlC Udtll UUllllg Lilt iillClVCli 


Veil ldUlct. 


THp m^vimiirin "Tp^nori^p Httip in TTnlli^pronrte Tminn 

i. IIC 1 IICL/Vil 1 1U1 11 I Co UUJ loC LilllC 111 IJlllliOt-V^VJI 1UO JLUU1IU 

over the path during the interval 


variable3 


The sum of the squares of total response times 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


variable6 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable7 


Total bytes received during the interval, preferably 
payload bytes 


variable8 


Reserved 


variable9 


Reserved 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 
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Table A-22: Column Allocations for Network Paths (Continued) 



MTF Variable 


Description 


variable 19 


Reserved 


variable20 


Available for use 


variable21 


Available for use 


variable22 


Available for use 


variable23 


Available for use 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-23: Column Allocations for Network Paths Using DNS 


MTF Variable 


Description 


variable 1 


The minimum response time in milliseconds found over 
the path during the interval 


variable2 


The maximum response time in milliseconds found 
over the path during the interval 


variable3 


Reserved 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


variable6 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable? 


Total bytes received during the interval, preferably 
payload bytes 


variable8 


Available for use 
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Table A-23: Column Allocations for Network Paths Using DNS (Continued) 



MTF Variable 


Description 


variable9 


Reserved 


variable 10 


Reserved 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable20 


Available for use 


variable21 


Available for use 


variable22 


Available for use 


variable23 


Available for use 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 
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Table A-24: Column Allocations for Network Paths Using HTTP 



MTF Variable 


Description 


variable 1 


The minimum response time in milliseconds found over 
the path during the interval 


variable2 


The maximum response time in milliseconds found 
over the path during the interval 


variable3 


Reserved 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


variable6 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable7 


Total bytes received during the interval, preferably 
payload bytes 


variable8 


The sum of the response time in milliseconds for the 
DNS transactions. Used to calculate the average 
associated DNS response time. 


vanaoiey 


Reserved 


variable 10 


Reserved 


variable 1 1 


Reserved 


Veil IaUiei£ 


Reserved 


variable 13 




variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Available for use 


variable20 


Available for use 


variable21 


Available for use 
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Table A-24: Column Allocations for Network Paths Using HTTP (Continued) 



MTF Variable 


Description 


variable22 


Available for use 


variable23 


Available for use 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Available for use 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 


Table A-25: Column Allocations for Network Paths for Voice over IP 
(Jitter) 


MTF Variable 


Description 


variable 1 


The minimum response time in milliseconds found over 
the path during the interval 


variable2 


The maximum response time in milliseconds found 
over the path during the interval 


variable3 


Reserved 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


variable6 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable7 


Total bytes received during the interval, preferably 
payload bytes 


variable8 


The sum of all jitter measurements from source to 
destination 


variable9 


The sum of all negative jitter measurements from source 
to destination 
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Table A-25: Column Allocations for Network Paths for Voice over IP 
(Jitter) (Continued) 



MTF Variable 


Description 


variable 1U 


The sum of all jitter measurements from destination to 
source 


variable 1 1 


The sum of all negative jitter measurements from 
destination to source 


variable 12 


The maximum positive jitter measurement from source 
to destination during the interval 


variable 13 


The maximum negative jitter measurement (absolute 
value) from source to destination during the interval 


variable 14 


The maximum positive jitter measurement from 
destination to source 


variable 15 


The maximum negative jitter measurement (absolute 
value) from destination to source 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable^ 


The total number of jitter measurements from source to 
destination 


variables 1 


The total number of positive jitter measurements from 
source to destination 


variable22 


The total number of negative jitter measurements from 
source to destination 


variable23 


The total number of jitter measurements from 
destination to source 


variable24 


The total number of positive jitter measurements from 
destination to source 


variable25 


The total number of negative jitter measurements from 
destination to source 


variable 26 


Available for use 


variable 27 


Available for use 
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Table A-25: Column Allocations for Network Paths for Voice over IP 
(Jitter) (Continued) 



MTF Variable 


Description 


variable 28 


Available for use 


variable 29 


Available for use 


variable 30 


Available for use 


Table A-26: Column Allocations for Application Paths for FirstSense 


MTF Variable 


Description 


variable 1 


Reserved 


variable2 


Reserved 


variable3 


The sum of the squares of total response times 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


variable6 


Total bytes transmitted and received during the 
interval, preferably payload bytes 


variable7 


Total bytes received during the interval, preferably 
payload bytes 


variable8 


The sum of connect response times in milliseconds 


variable9 


The number of connect attempts 


variable 10 


The number of connect sucesses 


variable 11 


Reserved 


variable 12 


Reserved 


variable 13 


Reserved 


variable 14 


Reserved 


variable 15 


Reserved 


variable 16 


Reserved 


variable 17 


Reserved 
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Table A-26: Column Allocations for Application Paths for FirstSense (Continued) 





MTF Variable 


Description 


I 


variable 18 


The total client time 


1 


variable 19 


The total server time 


1 


variable20 


Available for use 


1 


variable21 


Available for use 


I 


variable22 


Available for use 


1 


variable23 


Available for use 


1 


variable24 


Available for use 


1 


variable25 


Available for use 


i 1 


variable26 


Available for use 


1 ■ 


variable27 


Available for use 


P I 


variable28 


Available for use 


2 1 


variable29 


Available for use 


- I 


variable30 


Available for use 


1 1 


Table A-27: Column Allocations for Network Paths for Empire Service 
Response 


1 1 


MTF Variable 


Description 


1 


variable 1 


The minimum total response time in milliseconds found 
over the path during the interval 


1 


variable2 


The maximum total response time in milliseconds 
found over the path during the interval 


■ 


variable3 


Reserved 


1 


variable4 


The number of attempts to detect the round-trip time 
during the interval 


1 


variables 


The number of successful attempts to detect the 
round-trip time during the interval 


1 


variable6 


Reserved for total bytes transmitted and received 
during the interval, preferably payload bytes 
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Table A-27: Column Allocations for Network Paths for Empire Service 
Response (Continued) 



MTF Variable 


Description 


variable7 


Reserved for total bytes received during the interval, 
preferably payload bytes 


variable8 


The sum of connect response times in milliseconds 


variable9 


Reserved 


variable 10 


Reserved 


variablel 1 


The minimum connect resoonse time in milliseconds 


vfl riahlpl 2 


The maximum connprt rpsnonsp timp in millispconds 


\/o Hahlpl S 
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variable 14 


The minimum name lookup response time in 

m 1 1 1 1 ^ppnn d 

1 1 lllllDGV^VJl IV-lO 


variable 15 


The maximum name lookup response time in 


variable 16 


Reserved 


variable 17 


Reserved 


variable 18 


Reserved 


variable 19 


Reserved 


variable20 


The sum of the response times in milliseconds for the 

Ljllipilc Ocl VILc xvcopuiloc II aUoaUUUIlo 


variahle21 

V CXI J. CI UIv i-i x 


Thp minimum transaction rpsnon^p timp in 
milliseconds 


variable22 


The maximum transaction response time in 
milliseconds 


variable23 


Available for use 


variable24 


Available for use 


variable25 


Available for use 


variable26 


Available for use 


variable27 


Available for use 
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Table A-27: Column Allocations for Network Paths for Empire Service 
Response (Continued) 



MTF Variable 


Description 


variable28 


Available for use 


variable29 


Available for use 


variable30 


Available for use 
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1 1ntroduction 

This document describes the default profiles shipped with LiveExceptions in eHealth 4.7. It also describes their 
alarms, and possible actions to take to identify and correct the problem. 

For a general overview of LiveExceptions, please read the white paper "An Introduction to LiveExceptions" before 
reading this document. 

1.1 Audience 

The primary audience of this document are the people in the Network Operations Center (NOC) responsible for 
resolving performance and failure problems in the network, systems, and applications being monitored. For you, the 
document describes each of the alarms raised by the default profiles provided in the product. 

You should be familiar with Trend, At-a-Glance (AAG), and TopN reports provided in EHealth, and with using the 
Web User Interface to run and view those reports. You should also be familiar with using the LiveTrend user 
interface to monitor trend variables in real time. This document describes each alarm and recommends possible 
actions you can take to diagnose and repair the problem. 

The secondary audience of this document is the LiveExceptions administrator who sets up profiles, and applies them 
to groups (or group lists) of elements as subjects for LiveExceptions to monitor. 

For you, this document describes the alarm rules that make up each profile in detail. It describes what kinds of 
elements to which the profiles should be applied. It also forms a base from which you can develop your own rules 
and profiles. It describes some techniques used in developing good rules that minimize false alarms. 

1.2 Profiles, Alarm Rules, and Technologies 

Each profile defined below defines a collection of alarm rules that apply to a particular technology, and detects 
particular kinds of problems. The technology to which a profile applies, corresponds to a group technology. The 
technology is sometimes refined to apply to more specific kinds of elements. For example, the WAN delay profiles 
apply only to WAN ports, not to the ATM or Frame Relay Circuits that might be carried over them. Further, they 
differ based on the link speed — faster links can sustain a higher utilization than slower links. The kinds of profiles 
and the problems they detect include: 

• Delay profiles, which raise an alarm when an element is contributing to delay, either by being over utilized, or 
if we detect congestion. 

• Failure profiles, which raise an alarm when the element is down. It also raises an alarm if the element is 
suffering too many errors (and thus has effectively failed), or if it is in danger of failing perhaps because it is 
running out of some key resource, like inodes on a Unix Partition. 

• Unusual workload profiles, which raise an alarm if the workload presented to an element, or the work done by 
an element is unusual when compared against a historical baseline. 

• Host latency profiles, which raise an alarm if the latency to a host is unusually high, or beyond any reasonable 
limit. 

• Response profiles, which raise an alarm if response time problems are detected. 

Each profile is described in a separate table, with an entry in the table for each alarm rule (or set of closely related 
rules). Included in each table are the algorithm used, the variables examined, any thresholds and parameters used in 
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the rules, the window examined, the severity of the alarm, a description of what the alarm means, and recommended 
steps you should take to diagnose and repair the problem detected. 

1.3 Alarm Rule Algorithms 

LiveExceptions includes a family of algorithms that detect problems. These algorithms are implemented in the 
LiveExceptions Server, a background process that monitors the data collected by eHealth. These algorithms are 
invoked by alarm rules that are written in profiles. The profiles are applied to specific groups of elements, and this 
instructs the LiveExceptions Server on what things to watch, and what alarms to raise. Alarm rules indicate the 
problem detection algorithm to use, what element types and variables to watch, and a parameters that control the 
algorithm such as thresholds, windows, and baselines. 

Each algorithm is described in a section below. Within the tables that describe the profiles, each alarm rule refers to 
the algorithm(s) used by an abbreviation. The abbreviation for each algorithm is given below. 

1.4 Time Over Threshold (TOT) 

The time over threshold algorithm measures a variable against a fixed threshold on each poll period. It remembers 
the results over the recent past and measures how much time the variable was above (or below) the threshold. The 
period of time the algorithm looks back is called the "window", and is typically an hour. 

An example of a rule using the TOT rule as written in the tables below is 
Rule: (TOT, Bandwidth Utilization > 60%) 

Window: 15/60 min 

Parameters for the Time over threshold algorithm are: 



Parameter Description 



Variable 


The trend variable examined. In the example, Bandwidth Utilization. 


Threshold 


The value compared against. In the example, > 60%. 


Analysis window 


All of the samples collected during the analysis window (from the current sample time 




back) are examined. In the example, 60 minutes. 


Condition window 


The amount of time the condition must be true to raise the alarm. In the example, 15 




minutes. 



1.5 Time Over Dynamic Threshold (TODT) 

This algorithm compares the value of a trend variable against a dynamically computed threshold. Like the Time over 
threshold algorithm it compares the recent samples within the window against the threshold. If enough samples are 
above (or below) the threshold, an alarm is raised. We measure the window and duration as monitored times, not as 
numbers of samples. 

The threshold is computed dynamically, and sets the threshold far enough below a limit so it is unlikely that the 
variable will exceed some limit soon. An example is with partition space. If the partition becomes full, programs 
won't be able to write files, and the system may come to a halt. The system manager wants an alarm to be raised 
when the partition is nearly full. But when is the partition nearly full? 

The TODT algorithm determines when a partition is nearly full by looking at recent history over a baseline period of 
the past few weeks. The algorithm determines how much the partition utilization typically grows and shrinks over 
that period. It computes the variation seen in a trend variable over the entire baseline. Variation in a variable is 
measured using a statistic called the standard deviation. From this standard deviation, the algorithm computes how 
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much room should be left free. This computation uses a percentile value specified in the rule. The larger the 
percentile, the larger the variation left free. This variation is then subtracted from the limit to determine the dynamic 
threshold. 

An example of a rule using the TODT rule as written in the tables below is 
Rule: (TODT, Partition Utilization > 95 th percentile below 100%) 

Window: 5/60 min 

Baseline: 2 weeks 



Parameters for the TODT algorithm are: 



Parameter 


Description 


Variable 


In the example, Partition Utilization. 


Limit 


In the example, 100%. 


Percentile 


In the example, 95%. 


Baseline 


In the example, 2 weeks. 


Analysis window 


In the example, 60 minutes. 


Condition window 


In the example, 15 minutes. 



In the example, consider a 100 Mbyte partition whose space used has followed a very simple pattern. The partition 
starts at midnight 25% full. Every day, at midnight, a program runs which creates a 15 Mbyte temporary file, 
increasing the partition space utilization to 40%. Every day at noon, another program comes and deletes that file, 
returning the partition space utilization to 25% full. 

If this pattern persists through the entire baseline, it is fairly easy to compute that the standard deviation is 10.6%. 
Using a percentile of 95%, that corresponds to a predicted variation of about 17.5%. Which means the dynamic 
threshold would become 82.5%. As long as the partition space utilization stayed below that figure, no alarm is 
raised. 

Now suppose one afternoon, someone creates a 50 Mbyte file on the disk. Partition space utilization increases to 
75%, and all seems well. At midnight, the temporary file is created, partition utilization rises to 90%, and an alarm is 
raised. 

See section 12.2 for more information on statistics used in LiveExceptions. 

1 .6 Deviation from Normal Algorithms 

Three closely related algorithms compare the value of a trend variable against its normal range of values. The 
normal values are computed over a baseline period (typically 6 weeks) for each hour and for each day of the week. 
The baseline calculation determines the mean (average value) of the variable. It also computes a statistical measure 
of how much the variable varies, called the standard deviation. From this information, the deviation from normal 
algorithm can use one of three techniques for determining whether the value is normal: 

• absolute from mean 

• percentage from mean 

• deviation from mean 

All three algorithms can detect if the current value is above, below, or outside (either above or below) the normal 
range. They all use the Time Over Threshold window to reduce noise, that is, they only raise an alarm if the value is 
above, below, or outside the normal range for more that the condition window, out of the analysis window. 
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1.7 Absolute From Mean (AFM) 

Absolute from mean detects when the value is a fixed amount above or below the mean. This technique is most 
useful for detecting when a value has changed from a fixed or stable configuration. For example, this could be used 
to detect when a file system has been reconfigured and changes capacity. 

An example of a rule using the AFM rule as written in the tables below is 
Rule: (AFM, Total Buffers 1 0 buffers below mean) 

Window: 15/60 min 

Parameters for the AFM algorithm are: 



Parameter 

Variable 
Direction 

Absolute deviation 
Baseline 

Analysis window 
Condition window 



Description 

In the example, Total Buffers. 

In the example, below the normal range. 

In the example, 10 buffers. 

The length of the baseline history used to compute the 
mean. 

In the example, 60 minutes. 
In the example, 1 5 minutes. 



1.8 Percent From Mean (PFM) 

Percentage from mean detects when the value is above the mean by a percentage. For example, 100% above the 
mean detects when the value is twice the mean value. This technique is useful for detecting large changes in a value, 
in proportion to the average value. 

An example of a rule using the PFM rule as written in the tables below is 
Rule: (PFM, Broadcasts above 100% of mean) 

Window: 15/60 min 



Parameters for the PFM algorithm are: 



Parameter 


Description 


Variable 


In the example, Broadcasts. 


Direction 


In the example, above the normal range. 


Percentage deviation 


The value added (or subtracted if below) the mean to establish what is normal. 


Baseline 


The length of the baseline history used to compute the mean. 


Analysis window 


All of the samples collected during the analysis window (from the current sample time 




back) are examined. In the example, 60 minutes. 


Condition window 


The amount of time the condition must be true to raise the alarm. In the example, 15 




minutes. 
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1.9 Deviation From Mean (DFM) 

Deviation from mean detects when the value is above the mean by a dynamic percentile. The Percentile is computed 
dynamically based on the standard deviation. The higher the percentile, the further from the mean the value must be 
to raise the alarm. Deviation from mean dynamically determines both the mean and variation of the data. It adapts to 
cases where the mean changes, but the trend variable stays very close to the mean (a small standard deviation), and 
also to cases when the mean remains the same, but the variation from the mean is wide. Most of the rules in the 
unusual workload default profiles use the deviation from mean algorithm, often combined with the percentage from 
mean algorithm to eliminate small divergences from normal. This is described further in section 12.1. 

An example of a rule using the DFM rule as written in the tables below is 
Rule: (DFM, Users above 99 th %-tile) 

Window: 1 5/60 min 



Parameters for the DFM algorithm are: 



Parameter 


Description 


Variable 


In the example, Users. 


Direction 


In the example, above the normal range. 


Percentile 


In the example, 99 th percentile. Refer to section 12.2 for a longer description of 




percentiles and standard deviations. 


Baseline 


The length of the baseline history used to compute the mean. 


Analysis window 


In the example, 60 minutes. 


Condition window 


In the example, 1 5 minutes. 



1.10 Availability (Avail) 

The availability algorithm detects when an element is unavailable. The alarm will be cleared once eHealth sees that 
the element has been up for at least the length of the window defined in the alarm rule. The purpose of the window 
is to raise a single alarm when an element is "bouncing" up and down repeatedly. 

For hosts, routers, switches, servers, and remote access servers (RAS), when the host goes down, eHealth will be 
unable to ping or poll the host's agent. This will be seen as a Reachability problem first (see section 1.11 below). 
Later, when the host reboots and comes back up, eHealth will be able to ping and poll the host's agent. It will see 
that the host had rebooted, and was down, and will raise an alarm at that time. 

When the child elements within LAN and WAN interfaces, modems, ISDN, CPUs, disks, partitions, processes, 
process sets, and response paths hosts, go down, the host's agent may remain up and can be pinged and polled. In 
those cases, eHealth can detect that the child has gone down when it polls the element, and raise an alarm 
immediately. 

Parameters used in the algorithm: 
Parameter Description 

Availability window The availability alarm will be active if the element has gone down at any time during 
the window. It will only clear when the element has been up for the entire window. The 
alarm will be raised for at least one poll period 
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1.11 Reachability (Reach) 

The reachability algorithm detects when a ping of an element's agent IP address fails. 

For hosts, when the host goes down, the agent address stops responding to pings and a reachability alarm is 
immediately raised for the host. The normal sequence of events when a host goes down is: 

1. The host goes down. 

2. eHealth pings the host's agent IP address, the ping times out. eHealth retries the ping. When all the tries time 
out, the ping fails and a Host Unreachable alarm is raised. 

3. Eventually, the host reboots and comes back online. 

4. eHealth pings the host's agent IP address, the ping succeeds. eHealth then polls the host's agent and learns that 
the host rebooted, and that the host was unavailable for some time, and raises a Host Down alarm. 

5. If eHealth is able to ping the host's agent IP address for a continuous time equal to the window defined in the 
rule, the reachability alarm is cleared. 

Most child elements within a host, have the same agent IP address as their host parent. eHealth only pings an IP 
address once, and the results of that ping are used for all the elements with the same address. All the children have 
the same reachability as their parents. The default profiles therefore do not define reachability alarm rules for 
children. Instead these are limited to parent hosts. 
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Parameters for the reachability algorithm are: 
Parameter Description 

Reachability window The window determines how quickly the alarm is cleared. 

If the element was unreachable during the window, the alarm will stay active. It only 
clears when the element has been reachable for the entire window. The purpose of the 
window is to raise a single alarm when an element's reachability is "bouncing" up and 
down repeatedly. 
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2 Ethernet Profiles 

The profiles for Ethernets cover three cases: 

• Ethernet Shared, Ethernet segments built using 10base2 or 10base5 cabling systems (Thin Wire and Thick 
Wire Ethernet cables), or shared lObaseT or lOObaseT Ethernets built around shared hubs. 

• Ethernet Dedicated Full Duplex Switch Port, segments with only two stations. This profile is appropriate for 
dedicated Ethernet segments between a LAN switch and a device (router, system, or another switch). It is not 
appropriate for a segment where the switch port is connected to a hub. In that case, use the Delay - Shared 
Ethernet profile. 

• Ethernet Dedicated Half Duplex Switch Port, segments with only two stations, operating in Half Duplex 
mode. This is most often seen as the Ethernet segment between a LAN switch and a device (router, system, or 
another switch) where the switch port is set to operate in Full Duplex mode. 

The LAN element types used in eHealth are: 

• Ethernet, which most often is a shared Ethernet. MIB2LAN can fall into any of the above three cases; 
MIB2LAN Full Duplex can only be a Ethernet Full Duplex Switch Port. 

2.1 Ethernet - Delay Profiles 

For Ethernet elements, Concord provides the following delay profiles: 

• Shared Ethernet - Delay, see Table 1 . 

• Ethernet Dedicated Half Duplex Switch Port - Delay, see Table 2. 

• Ethernet Dedicated Full Duplex Switch Port - Delay, see Table 3. 
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Bandwidth Utilization and Ethernet Elements 

This section describes Bandwidth Utilization In and Out variables and how their actual implementation depends 
upon the element's agent. It is useful for understanding Too Many Discards Out and Over Utilized In/Out messages. 

For an Ethernet element, Bandwidth Utilization In and Bandwidth Utilization Out are based on the Bytes In and 
Bytes Out on this interface. The total Bandwidth Utilization is based on the total Bytes for the Ethernet segment, 
which is either all the bytes seen on the wire, or simply the sum of Bytes In and Bytes Out on the interface. Which is 
true depends on the agent and what it measures, and on the MTF used to poll the agent as shown in the table below: 



If the Agent Implements ... 


Bandwidth Utilization is Based On ... 


RMON etherStats table defined in RFC 1271 
or 

RFC 1757 
and 

places the MAC interface into promiscuous mode 


Promiscuous mode counts every frame and byte on the 
wire. The Bandwidth Utilization is the total utilization 
on the wire. 


MIB2 and the dot3 extensions in RFC 1398 
or 

RFC1623 
or 

the SMIv2 version RFC1650 


Bandwidth Utilization is based on Bytes In + Bytes Out. 


Agents in Hubs typically implement proprietary MIBs 


The agents generally count all the bytes and frames on 
the wire. 
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Table 1 Ethernet Shared - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Ethernet, MIB2LAN 

Message: 

Description: 



Recommendations: 



Ethernet, MIB2LAN 

Message: 

Description: 



Recommendations: 



Ethernet, MIB2LAN 
Message: 



15/60 min 



Minor 



TOT, Bandwidth Utilization (%) >40% 
Over Utilized 

The bandwidth utilization is too high for a shared Ethernet segment. High Utilization 
may lead to too many collisions and a loss of efficiency on the LAN. It can also lead to 
queuing delays in the stations on the LAN segment, as the station must wait to send 
frames on the LAN, and other waiting frames are delayed awaiting their turn. 

• Upgrade the Ethernet LAN to a higher speed (100Mbit or 1 Gbit).. 

• Reduce the number of stations on the LAN segment. One way to do this is split 
the shared LAN into multiple segments using a switch or bridge. 

• Replace hubs with switches. 

• Remove traffic from the LAN, for example, move a server to a switch port. 

• Check if the segment really is shared, perhaps it really is a switch port. 



15/60 min 



Minor 



TOT, Bandwidth Utilization In (%) >30% 
TOT, Bandwidth Utilization Out (%) >30% 
Over Utilized In 
Over Utilized Out 

The bandwidth utilization in or the bandwidth utilization out on this interface 1 is too 
high for a shared Ethernet segment High Utilization may lead to too many collisions 
and a loss of efficiency on the LAN. It can also lead to queuing delays in the stations 
on the LAN segment, as the station must wait to send frames on the LAN, and other 
waiting frames are delayed awaiting their turn. 

• Upgrade the Ethernet LAN to a higher speed (100Mbit or 1Gbit).. 

• Reduce the number of stations on the LAN segment. One way to do this is split 
the shared LAN into multiple segments using a switch or bridge. 

• Replace hubs with switches. 

• Remove traffic from the LAN, for example, move a server to a switch port. 

• Check if the segment really is shared, perhaps it really is a switch port. 



TOT, Collisions (%) >15% 
Collisions too high 



15/60 min Minor 



1 For an Ethernet element, Bandwidth Utilization In and Bandwidth Utilization Out are based on the Bytes In and 
Bytes Out on this interface. The total Bandwidth Utilization is based on the total Bytes for the Ethernet segment, 
which is either all the bytes seen on the wire, or simply the sum of Bytes In and Bytes Out on the interface. Which is 
true depends on the agent and what it measures, and on the MTF used to poll the agent. In general, if the agent 
implements the RMON etherStats table defined in RFC1271 or its replacement RFC1757, and places the MAC 
interface into promiscuous mode to count every frame and byte on the wire, the Bandwidth Utilization is the total 
utilization on the wire. If the agent simply implements MIB2 and the dot3 extensions in RFC 1398, its replacement 
RFC1623, or the SMIv2 version RFC1650, then Bandwidth Utilization is based on Bytes In + Bytes Out. Agents in 
Hubs typically implement proprietary MIBs which generally count all the bytes and frames on the wire. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Description: When multiple stations try to send on an Ethernet LAN segment at the same time, they 

may collide. Ethernets use collisions to decide who will send first, using a technique 
called CSMA/CD. Thus, on a shared Ethernet, collisions are a normal occurrence. 
However, too many collisions lead to a loss of efficiency, as the time spent resolving a 
collision and deciding which station can send uses time and bandwidth. Also note that 
certain failures in Ethernet end stations can cause excessive collisions. For example, if 
the collision detection circuit fails, then the station will continue to send after a 
collision, which in turn causes more collisions. 



Recommendations: 



Same as for Bandwidth Over Utilized 



Ethernet, MIB2LAN 

Message: 
Description: 



Recommendations : 



15/60 min 



Minor 



(DFM ? Broadcasts > 99.9 percentile) AND 
(TOT, Broadcasts > 200 frames/sec) 
Broadcast Storm 

Under certain conditions, the higher layer protocols using the LAN can generate too 
many broadcast frames. Broadcast and multicast frames pass through switches and 
bridges. Every station on the extended LAN must handle broadcast frames, and thus 
too many broadcast frames can have a significant impact on each station attached to 
the extended LAN. 

Note: An extended LAN\s the entire collection of Ethernet segments interconnected by 
bridges and switches. Extended LANs are also called broadcast domains. 

• Determine the specific protocol or protocols causing the storm. For example, run 
a Traffic Accountant report on protocols for a probe attached to the extended 
LAN. 

• Once the protocol generating too many broadcasts is identified, determine the 
reason why so many broadcasts are being sent, and correct. 

• Replace a switch at the top of the switch hierarchy with a router to separate 
broadcast domains. 



MIB2LAN, Ethernet TOT, Discards Out % > 1 % 1 5/60 min Warning 

Message: Too many discards out 

Description: This alarm will be raised only for LAN elements that collect interface statistics, 

MIB2LAN, and some Ethernet agents Error ^^—k not denned. mea an interface queue 

grows, eventually the router, host, or switch will run out of buffers to hold the queued 
frames, and any additional frames that should be sent out the interface will be 
discarded. Discards are normal in IP networks because the TCP protocol is designed to 
drive the bottleneck link to saturation. The resulting congestion is then signaled back 
to the TCP sender as discarded (lost) packets. Too many discards lower the overall 
network efficiency, as the discarded packets must be resent. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: While most discards are due to queuing discards, there are other reasons a router may 

discard packets. Depending on the device, see if any of these other reasons maybe 

causing discards: 

• If the link is over utilized, deal with it as described above in the discussion of the 
Over Utilized alarm. Note this may only move the bottleneck to another link. 
After increasing the speed, look to see if other links in the path are now seeing too 
many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 

• Implement RED (Random Early Discards) on the link. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 

on UDP or protocols other than TCP/IP protocols, RED may not affect it. 
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Table 2 Ethernet Half Duplex Switch Port - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Ethernet, 
M1B2LAN 
Message: 
Description: 



Recommendations: 



Ethernet, 

MIB2LAN 

Message: 

Description: 



Recommendations: 



TOT, Bandwidth Utilization (%) >60% 



15/60min 



Minor 



Over Utilized 

See the discussion for Shared Ethernet above. Because only two stations are on this 
segment (the switch port and the station), the bandwidth utilization can be much higher 
than for a shared LAN. 

• Upgrade the Ethernet LAN to a higher speed ( 1 00Mbit or 1 Gbit).. 

• Switch to a Full duplex switch port and station. 

• Remove traffic from the LAN, for example, if a web server is attached to the 
switch port, add a second server, and split the requests equally over the pair. 

TOT, Bandwidth Utilization In (%) >50% 1 5/60 min Minor 

TOT, Bandwidth Utilization Out (%) >50% 
Over Utilized In 
Over Utilized Out 

Jhe^bandwidth utilization in or the bandwidth utilization out on this interface Error! 
oo mark not defined. ig toQ for a Ethemet switch port High Utilization may lead to too 
many collisions and a loss of efficiency on the LAN. It can also lead to queuing delays 
in the stations on the LAN segment, as the station must wait to send frames on the 
LAN, and other waiting frames are delayed awaiting their turn. 

While there are only two stations on the LAN, because the stations are operating in 
Half Duplex mode, they can still collide. 

• Upgrade the Ethernet LAN to a higher speed (1 00Mbit or 1 Gbit).. 

• Remove traffic from the LAN, for example, split the workload to a server across 
two servers. 



Ethernet, 
MIB2LAN 
Message: 
Description: 



Recommendations: 
Ethernet, MIB2LAN 

Message: 



TOT, Collisions (%) >15% 



1 5/60 min Minor 



Too many collisions 

See the discussion on "Collisions Too High" for Ethernet Shared above. With only two 
stations on the LAN segment, the chances of a collision are significantly reduced, and 
thus more traffic can be sent and received on the LAN before collisions reduce the 
efficiency. 

Same as for bandwidth over utilized 



(DFM, Broadcasts > 99.9percentile) AND 
(TOT, Broadcasts > 200 frames/sec) 
Broadcast Storm 



15/60 min 



Minor 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Description: 



Recommendations: 



Under certain conditions, the higher layer protocols using the LAN can generate too 
many broadcast frames. Broadcast and multicast frames pass through switches and 
bridges. Every station on the extended LAN 2 must handle broadcast frames, and thus 
too many broadcast frames can have a significant impact on each station attached to 
the extended LAN. 

Note: An extended LAN is the entire collection of Ethernet segments interconnected by 
bridges and switches. Extended LANs are also called broadcast domains. 

• Determine the specific protocol or protocols causing the storm, for example by 
running a traffic accountant report on protocols, for a probe attached to the 
extended LAN. 

• Once the protocol generating too many broadcasts is identified, determine the 
reason why so many broadcasts are being sent, and correct. 

• Routers can be used to separate broadcast domains, so replace a switch at the top 
of the switch hierarchy with a router. 



MIB2LAN, Ethernet 

Message: 

Description: 



Recommendations: 



Warning 



TOT, Discards Out % > 1 % 1 5/60 min 

Too many discards out 

This alarm will be raised only for LAN elements that collect interface statistics, 



MIB2LAN, and some Ethernet agents 



Error! Bookmark not defined. 



. When an interface queue 



grows, eventually the router, host, or switch will run out of buffers to hold the queued 
frames, and any additional frames that should be sent out the interface will be 
discarded. Discards are normal in IP networks because the TCP protocol is designed to 
drive the bottleneck link to saturation. The resulting congestion is then signaled back 
to the TCP sender as discarded (lost) packets. Too many discards lower the overall 
network efficiency, as the discarded packets must be resent. 

• While most discards are due to queuing discards, there are other reasons a router 
may discard packets. Depending on the device, see if any of these other reasons 
may be causing discards. 

• If the link is over utilized, deal with it as described above. Note this may only 
move the bottleneck to another link. After increasing the speed, look to see if 
other links in the path are now seeing too many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 

• Implement RED (Random Early Discards) on the link. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 
on UDP, or protocols other than TCP/IP protocols, RED may not affect them. 



An extended LAN is the entire collection of Ethernet segments interconnected by bridges and switches. Extended 
LANs are also called broadcast domains. 
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Table 3 Ethernet Full Duplex Switch Port - Delay 

Element Type Rule, Trend Variable, Threshold Window Severity 

Ethernet TOT, Bandwidth Utilization (%) >90% 1 5/60 min Minor 

Message: Over Utilized 

Description: Refer to the discussion for Shared Ethernet above. Because only two stations are on 

this segment (the switch port and the station), and because the link operates in full 
duplex, the bandwidth utilization can be much higher than for a shared LAN or even a 
dedicated, half duplex LAN. 

Recommendations: • Upgrade the Ethernet LAN to a higher speed (100Mbit or 1Gbit).. 

• Remove traffic from the LAN, for example, if a web server is attached to the 

switch port, add a second server, and split the requests equally over the pair. 

MIB2LAN Full Duplex, TOT, Bandwidth Utilization In (%) >90% 15/60 min Minor 

Ethernet TOT, Bandwidth Utilization Out (%) >90% 

Message: Over Utilized In 

Over Utilized Out 

Description: The bandwidth utilization in or the bandwidth utilization out on this interface is too 

high. High Utilization may lead to queuing delays as waiting frames are delayed 
awaiting their turn. 



Recommendations: • Upgrade the Ethernet LAN to a higher speed (100Mbit or 1Gbit).. 

• Remove traffic from the LAN, for example, split the workload on a web server 
across multiple web servers. 



Ethernet, MIB2LAN 

Message: 
Description: 



Recommendations: 



(DFM, Broadcasts > 99.9percentile) AND 15/60 min Minor 

(TOT, Broadcasts > 200 frames/sec) 
Broadcast Storm 

Under certain conditions, the higher layer protocols using the LAN can generate too 
many broadcast frames. Broadcast and multicast frames pass through switches and 
bridges. Every station on the extended LAN must handle broadcast frames, and thus 
too many broadcast frames can have a significant impact on each station attached to 
the extended LAN, 

Note: An extended LAN is the entire collection of Ethernet segments interconnected by 
bridges and switches. Extended LANs are also called broadcast domains. 

• Determine the specific protocol or protocols causing the storm. For example, run a 
Traffic Accountant report on protocols for a probe attached to the extended LAN. 

• Once the protocol generating too many broadcasts is identified, determine the 
reason why so many broadcasts are being sent, and correct it. 

• Routers can be used to separate broadcast domains, so replace a switch at the top 
of the switch hierarchy with a router. 



MIB2LAN Full Duplex, TOT, Discards Out % > 1 % 1 5/60 min Warning 

Ethernet 

Message: Too many discards out 

Description: This alarm will be raised only for LAN elements that collect interface statistics, 

MIB2LAN, and some Ethernet agents. When an interface queue grows, eventually the 
router, host, or switch will run out of buffers to hold the queued frames, and any 
additional frames that should be sent out the interface will be discarded. Discards are 
normal in IP networks because the TCP protocol is designed to drive the bottleneck 
link to saturation. The resulting congestion is then signaled back to the TCP sender as 
discarded (lost) packets. Too many discards lower the overall network efficiency, as 
the discarded packets must be resent. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: • While most discards are due to queuing discards, there are other reasons a router 

may discard packets. Depending on the device, see if any of these other reasons 

maybe causing discards. 

• If the link is over utilized, deal with it as described above. Note this may only 
move the bottleneck to another link. After increasing the speed, look to see if 
other links in the path are now seeing too many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 

• Implement RED (Random Early Discards) on the link. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 

on UDP, or protocols other than TCP/IP protocols, RED may not affect them. 



2.2 Ethernet Failure Profiles 

For Ethernet elements, we provide the following failure profiles: 

• Shared Ethernet - Failure, see Table 4. 

• Ethernet Half Duplex Switch Port - Failure, see Table 4. 

• Ethernet Full Duplex Switch Port - Failure, see Table 5. 

See the delay profiles for a description of when it is appropriate to use these three failure profiles. 
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Table 4 Ethernet Shared Segment - Failure, and 
Ethernet Half Duplex Switch Port - Failure 



Element Type Rule, Trend Variable, Threshold Window Severity 

Ethernet, MIB2LAN Availability 30 min Critical 

Message: LAN Down 

Description: How LAN availability is measured depends on the agent monitoring the LAN: 

• If the agent is built in to a hub or repeater, then if the hub or repeater is down, the 
LAN is down. Note that some of the stations on the LAN may still be able to 
communicate, but the LAN will be partitioned. 

• If the agent is a promiscuous listening station (for example an RMON enabled 
MAC stations supporting the etherStats MIB, then the state of the station 
determines the state of the LAN. 

• If the agent is a station that supports the dot3 MIB defined in RFC 1284 and in 
RFC1 643 and RFC 1650 that replaced it, then the station state determines the LAN 
state. 



Recommendations: • Drilldown to an AAG report for this LAN to see if any problems led up to the 

failure. 

Ethernet TOT, Errors (%) > 5% 

Message: Too many errors 

Description: The percentage of frames sent on the Ethernet with errors is too high. 

Recommendations: Investigate 

• Alignment errors 

• Too many collisions 
•' Late collisions 

• Runt (too small) frame 

• Babbling stations (stations always sending) 



Ethernet, MIB2L AN TOT, Discards In % > 1 % 1 5/60 min Major 

Message: Received Frame discards 

Description: Too many frames were discarded after they were received 

A frame going through a router or switch gets processed by three processes, a 
receiving process (frames in), a forwarding process, and a sending process (frames 
out). Layer 2 forwarding (done by a switch or bridge) forwards frames, while layer 3 
forwarding (done by a router) forwards packets. The frame can be discarded (lost) in 
any of the three processes. 

Frames lost in sending are generally lost due to queue losses (refer to the discussion of 
too many discards out above). 

Packets lost in layer 3 forwarding are generally lost because the destination is 
unknown or unreachable. Frames are rarely lost in layer 2 forwarding. 
Frames lost in receiving (In frames) can be lost for a variety of reasons: 

• The receive process may not have enough buffers to hold the incoming frames. 

• The router or switch may use input queueing a technique where frames are 
buffered (queued) in the receiving interface hardware. Other designs are 

• shared memory where a central memory is used to hold frames. In this design 
the receiving interface and sending interface access the shared memory to 
read and write the frames. 

• output queueing where the frames are held in buffers in the sending interface. 
When a router or switch uses input queueing, if an outbound link is too busy, the 
input queues in the receiving interface fill up, and discard frames. 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Recommendations: 



Ethernet 
Message: 
Description: 

Recommendations: 



Increase the amount of memory (buffers) allocated to the receiving interface. 
If the router or switch uses input queueing, check the output interfaces to see 
which (if any) are too busy. If any are, solve that problem. 
Check to see if there are any other errors which may be causing this interface to 
discard frames. 



15/60 min 



Minor 



TOT, Bandwidth Utilization > 100% 
Speed set too low 

The bandwidth utilization was measured at over 100%. This is most often caused by 
the speed being set incorrectly in the poller configuration. 

• Check the speed of the shared segment. For example, if it is set to lOMit/sec is the 
speed really 100 Mbit/sec? 

• Check if the segment is really a full duplex switch port, if so, change the element 
type (to MIB2LAN Full Duplex). This probably means changing the MTF, and 
may mean recertifying the device. 



6-Jun-00 



Concord Communications, Inc. 



22 



Live Exceptions Profiles V1 .9 



Table 5 Full Duplex Switch Port - Failure 

Element Type Rule, Trend Variable, Threshold Window Severity 

Ethernet Availability 30min Critical 

Message: LAN Down 

Description: How LAN availability is measured depends on the agent monitoring the LAN. 

• If the agent is built in to a hub or repeater, then if the hub or repeater is down, the 
LAN is down. Note that some of the stations on the LAN may still be able to 
communicate, but the LAN will be partitioned. 

• If the agent is a promiscuous listening station (for example an RMON enabled 
MAC stations supporting the etherStats MIB, then the state of the station 
determines the state of the LAN. 

• If the agent is a station that supports the dot3 MIB defined in RFC 1284 and in 
RFC1 643 and RFC 1650 that replaced it, then the station state determines the LAN 
state. 



Recommendations: • Drilldown to an A AG report for this LAN to see if any problems led up to the 

failure. 

Ethernet TOT, Errors (%) > 5% 1 5/60 min Major 

Message: Too many errors 

Description: The percentage of frames sent on the Ethernet with errors is too high. 

Recommendations: Investigate 

• Alignment errors 

• Too many collisions 

• Late collisions 

• Runt (too small) frame 

• Babbling stations (stations always sending) 



Ethernet 
Message: 
Description: 



TOT, Collisions (%) >0% 1/60 min Major 

Misconfigured - collisions on Full Duplex Ethernet Port 

Switch ports operating in full duplex mode should not experience collisions. However, 
both the switch port, and the station must be set up to use full duplex. If either is 
misconfigured, the LAN segment will experience collisions. 



Recommendations: 



Check that both the stations and the switch port support full duplex. 
Check that both are properly configured to be in full duplex mode. 
You may have applied the full duplex profile to the wrong element. 



Ethernet, TOT, Discards In % > 1 % 

MIB2LAN Full Duplex 

Message: Received Frame discards 



15/60 min 



Major 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Description: Too many frames were discarded after they were received. 

A frame going through a router or switch gets processed by three processes, a 
receiving process (frames in), a forwarding process, and a sending process (frame out). 
The frame (packet) can be discarded (lost) in any of the three processes. 

Packets lost in sending are generally lost due to queue losses (refer to the discussion of 
too many discards out above). 

Packets lost in forwarding are generally lost because the destination is unknown or 
unreachable. 

Packets lost in receiving (In frames) can be lost for a variety of reasons: 

• The receive process may not have enough buffers to hold the incoming frames. 

• The router or switch may use input queueing a technique where frames are 
buffered (queued) in the receiving interface hardware. Other designs are 

• shared memory where a central memory is used to hold frames. In this design 
the receiving interface and sending interface access the shared memory to 
read and write the frames. 

• output queueing where the frames are held in buffers in the sending interface. 
When a router or switch uses input queueing, if an outbound link is too busy, the 
input queues in the receiving interface fill up, and discard frames. 

Recommendations: • Increase the amount of memory (buffers) allocated to the receiving interface, 

• If the router or switch uses input queueing, check the output interfaces to see 
which (if any) are too busy. If there are, solve that problem. 

• Check to see if there are any other errors which may be causing this interface to 
discard frames. 



Ethernet, TOT, Bandwidth Utilization > 1 00% 1 5/60 min Minor 

MIB2LAN Full Duplex 

Message: Speed set too low 

Description: The bandwidth utilization was measured at over 100%. This is most often caused by 

the speed being set incorrectly in the poller configuration. 
Recommendations: • Check the speed of the shared segment. For example, if it is set to lOMit/sec is the 
speed really 100 Mbit/sec? 



2.3 Ethernet - Unusual Workload Profiles 

For Ethernets, three Unusual Workload profiles are provided. They are all the same. 
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Table 6 Shared Ethernet Segment - Unusual Workload, 
Ethernet Half Duplex Switch Port - Unusual Workload, and 
Ethernet Full Duplex Switch Port - Unusual Workload 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Ethernet 

Message: 
Description: 



Recommendations: 



15/60 min 



Warning 



(DFM, Broadcasts above 99.9 percentile) AND 
(TOT, Bandwidth Utilization > 10%) 
Unusually high broadcasts 

The number of broadcast frames on the LAN is unusually high. The rule is combined 
with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. Refer to 12.1. 

• High broadcasts without high unicasts may precede broadcast storms. 

• ARP (the address resolution protocol used in IP networks) sends broadcast 
packets to locate stations that have a particular IP address, or a router that can 
forward a packet to that IP address. An unusually high number of broadcast 
frames may indicate problems in ARP. 

• Similarly, any other protocol that uses broadcast frames to locate other systems or 
services maybe having a problem finding those systems or services. 

• A new application or protocol may have been added to the Extended LAN that 
uses broadcast frames. 



(DFM, Multicasts above 99.9 percentile) AND 1 5/60 min Warning 
(TOT, Bandwidth Utilization > 10%) 
Unusually high multicasts 

The number of multicast frames on the LAN is unusually high. The rule is combined 
with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. See 12.1. 

Multicast frames are used like broadcast frames in location protocols to find systems 
or services. Like broadcast frames, they flood through the entire extended LAN. 
However, because multicasts are protocol specific, only hosts participating in the 
protocol receive them and must process them. 
Recommendations: • High multicasts without high unicasts may precede multicast storms. 

• A protocol that uses multicast frames to locate other systems or services may be 
having a problem finding those systems or services. 

• A new application or protocol may have been added to the Extended LAN that 
uses multicast frames. 



Ethernet 

Message: 
Description: 



Ethernet (DFM, Unicasts above 99.9 percentile) AND 15/60 min Warning 

(TOT, Bandwidth Utilization > 10%) 
Message: Unusually high unicasts 

Description: The number of unicast frames on the LAN is unusually high. The rule is combined 

with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. 

Recommendations: • A new application or protocol may have been added to the Extended LAN. 



MIB2LAN Port, (DFM, Frames In above 99 percentile) AND 1 5/60 min Warning 

MIB2LAN Full Duplex (TOT, Bandwidth Utilization In > 10%) 



(DFM, Frames Out above 99 percentile) AND 
(TOT, Bandwidth Utilization Out > 10%) 
Message: Unusually high frames in 

Unusually high frames out 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Description: The number of frames on the LAN is unusually high. The rule is combined with a test 

that Bandwidth Utilization is > 10% to filter out unusually high, but very small, 

values. 

Recommendations: • A new application or protocol may have been added to the Extended LAN. 



MIB2LAN Port, (DFM, Non-Unicast Frames In above 99 1 5/60 min Warning 

MIB2LAN Full Duplex percentile) AND (TOT, Bandwidth Utilization In 

> 10%) 



(DFM, Non-Unicast Frames Out above 99 
percentile) ) AND (TOT, Bandwidth Utilization 
Out > 10%) 

Message: Unusually high non-unicast frames in 

Unusually high non-unicast frames out 
Description: The number of non-unicast frames on the LAN is unusually high. The rule is combined 

with a test that Bandwidth Utilization is > 1 0% to filter out unusually high, but very 

small, values. 

MIB2LAN elements combine broadcast and multicast frames into a category called 
non-unicast frames. As described above, broadcast and multicast frames are often used 
in protocols to locate systems and services. 



Recommendations: • High broadcasts and multicasts without high unicasts may precede broadcast 

storms. 

• ARP (the address resolution protocol used in IP networks) sends broadcast 
packets to locate stations that have a particular IP address, or a router that can 
forward a packet to that IP address. An unusually high number of broadcast 
frames may indicate problems in ARP. 

• Similarly, any other protocol that uses broadcast or multicast frames to locate 
other systems or services may be having a problem finding those systems or 
services, 

• A new application or protocol may have been added to the Extended LAN that 
uses broadcast or multicast frames. 
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3 Token Ring Profiles 

Three profiles are provided for Token Ring LANs: 

• Token Ring - Delay, see Table 7. 

• Token Ring - Failure, see Table 8. 

• Token Ring - Unusual Workload, see Table 9. 
Token rings can be represented by elements whose type is: 

• Token Ring, which generally represents a station or hub that monitors the total traffic on the LAN 

• MIB2 LAN, which generally represents a station on the LAN that monitors only the traffic it sends and 
receives. 

3.1 Token Ring - Delay Profile 

Table 7 Token Ring - Delay 

Element Type Rule, Trend Variable, Threshold Window Severity 

Token Ring, MIB2LAN TOT, Bandwidth Utilization > 80% 1 5/60 min Minor 

Message: Over Utilized 

Description: The bandwidth utilization is too high for a shared Token Ring LAN. High Utilization 

leads to queueing delays in the stations on the LAN, as the station must wait for the 
token to be released by other stations before it can send a frame on the LAN, and other 
waiting frames are delayed awaiting their turn. 

Recommendations: • Upgrade the Token Ring to a higher speed LAN, 4Mbps to 16Mbps, or upgrade to 

100 Mbps FDDI or 100Mbps or lGbps Ethernet. 

• Reduce the number of stations on the Ring. 

• Remove traffic from the LAN. 
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Element Type 



Rule^ Trend Variable, Threshold 



Window 



Severity 



Token Ring 
Message: 
Description: 
Recommendations: 



15/60 min 



Minor 



2. 



TOT, Soft Errors > 0.1 errors/sec 
Too many soft errors 

Soft errors are recoverable errors at the MAC layer of the LAN. 
1 . Determine the specific type of error. 

Run a Trend Report for this Token Ring for the days where the alarm is active 
selecting the following soft error variables: 
TR Abort Errors 
TR Address Copied Errors 
TR Burst Errors 
TR Congestion Errors 
TR Frequency Errors 
TR Frame Copied Errors 
TR Internal Errors 
TR Line Errors 
TR Lost Frame Errors 
TR Token Errors 
Select the chart style of stacked area or stacked bar chart. 

The color of the areas or bars should identify which specific type of soft errors have 
occurred. Correct that problem. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

MIB2LAN TOT, Bandwidth Utilization In (%) >50% 1 5/60 min Minor 

TOT, Bandwidth Utilization Out (%) >50% 
Message: Over Utilized In 

Over Utilized Out 

Description: The bandwidth utilization in or the bandwidth utilization out on this interface is too 

high for a Token Ring. High Utilization may lead to queueing delays in the stations on 
the LAN segment, as the station must wait for the token to be released by other 
stations before it can send a frame on the LAN, and other waiting frames are delayed 
awaiting their turn. 



Recommendations: • Upgrade the Token Ring to a higher speed LAN, 4Mbps to 1 6Mbps, or upgrade to 

100 Mbps FDDI or 100Mbps or lGbps Ethernet. 

• Reduce the number of stations on the Ring. 

• Remove traffic from the LAN. 



MIB2LAN (DFM, Non-Unicasts > 99.9percentile) AND 1 5/60 min Minor 

(TOT, Non-Unicasts > 200 frames/sec) 
Message: Broadcast Storm 

Description: Under certain conditions, the higher layer protocols using the LAN can generate too 

many broadcast frames. Broadcast and multicast frames pass through switches and 
bridges. Every station on the extended LAN must handle broadcast frames, and thus 
too many broadcast frames can have a significant impact on each station attached to 
the extended LAN. 



Recommendations: • Determine the specific protocol or protocols causing the storm. For example, runn 

a Traffic Accountant report on protocols for a probe attached to the extended 
LAN. 

• Once the protocol generating too many broadcasts is identified, determine the 
reason why so many broadcasts are being sent, and correct it. 

• Routers can be used to separate broadcast domains, so replace a switch at the top 
of the switch hierarchy with a router. 



MIB2LAN TOT, Discards Out % > 1 % 1 5/60 min Warning 

Message: Too many discards out 

Description: This alarm will be raised only for Token Ring LAN elements that collect interface 

statistics MIB2LAN. When an interface queue grows, eventually the router, host, or 
switch will run out of buffers to hold the queued frames, and any additional frames 
that should be sent out the interface will be discarded. Discards are normal in IP 
networks because the TCP protocol is designed to drive the bottleneck link to 
saturation. The resulting congestion is then signaled back to the TCP sender as 
discarded (lost) packets. Too many discards lower the overall network efficiency, as 
the discarded packets must be resent. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: While most discards are due to queueing discards, there are other reasons a router may 

discard packets. Depending on the device, see if any of these other reasons may be 

causing discards; 

• If the link is over utilized, deal with it as described above. Note this may only 
move the bottleneck to another link. After increasing the speed, look to see if 
other links in the path are now seeing too many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 

• Implement RED (Random Early Discards) on the link. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 

on UDP, or protocols other than TCP/IP protocols, RED may not affect them. 
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3.2 Token Ring - Failure Profile 

Table 8 Token Ring - Failure 

Element Type Rule, Trend Variable, Threshold Window Severity 

Token Ring Availability 30 min Critical 

Message: LAN Down 

Description: How LAN availability is measured depends on the agent monitoring the LAN. 

• If the agent is built into a hub or repeater, then if the hub or repeater is down, the 
LAN is down. Note that some of the stations on the LAN may still be able to 
communicate, but the LAN will be partitioned. 

• If the agent is a promiscuous listening station (for example an RMON enabled 
MAC stations supporting the tokenringMLStats MIB), then the state of the station 
determines the state of the LAN. 

Recommendations: • Drilldown to an AAG report for this LAN to see if any problems led up to the 

failure. 



Token Ring 

Message: 

Description: 



Recommendations: 



TOT, Hard Errors > 0.01 errors/sec 15/60 min Major 

Too many hard errors 

Hard errors are fatal errors that may be recovered from, but which often indicate a 
hardware failure in the ring. Hard failures include: TR Set Recovery Mode, TR Signal 
Loss, TR Bit Streaming, and TR Contention Mode errors. 
• Drilldown to an AAG report to see the history of this problem. 



MIB2LAN 
Message: 
Description: 
Recommendations: 



TOT, Errors (%) > 5% 
Too many errors 

The percentage of frames sent on the LAN with errors is too high. 
• Drilldown to an AAG to diagnose the problem. 



MIB2LAN 

Message: 

Description: 



TOT, Discards In % > 1 % 1 5/60 min 

Received Frame discards 

Too many frames were discarded after they were received 



Major 



A frame going through a router or switch gets processed by three processes, a 
receiving process (frames in), a forwarding process, and a sending process (frame out). 
The frame (packet) can be discarded (lost) in any of the three processes. 

Packets lost in sending are generally lost due to queue losses (refer to the discussion of 
too many discards out above). 

Packets lost in forwardingare generally lost because the destination is unknown or 
unreachable. 

Packets lost in receiving (In frames) can be lost for a variety of reasons. 

• The receive process may not have enough buffers to hold the incoming frames. 

• The router or switch may use input queueing a technique where frames are 
buffered (queued) in the receiving interface hardware. Other designs are 

• shared memory where a central memory is used to hold frames. In this design 
the receiving interface and sending interface access the shared memory to 
read and write the frames. 

• output queueing where the frames are held in buffers in the sending interface. 
When a router or switch uses input queueing, if an outbound link is too busy, the 
input queues in the receiving interface fill up, and discard frames. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: • Increase the amount of memory (buffers) allocated to the receiving interface. 

• If the router or switch uses input queueing, check the output interfaces to see 
which (if any) are too busy. If any are, solve that problem. 

• Check to see if there are any other errors which may be causing this interface to 
discard frames. 



MIB2LAN TOT, Bandwidth Utilization > 1 00% 1 5/60 min Minor 

Message: Speed set too low 

Description: The bandwidth utilization was measured at over 100%. This is most often caused by 

the speed being set incorrectly in the poller configuration. 
Recommendations: • Check the speed of the ring. For example, if it is set to 4Mit/sec, is the speed 
really 1 6Mbit/sec? 
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3.3 Token Ring - Unusual Workload Profile 

Table 9 Token Ring - Unusual Workload 

Element Type Rule, Trend Variable, Threshold Window Severity 

Token Ring (DFM, Broadcasts above 99.9 percentile) AND 15/60 min Warning 

(TOT, Bandwidth Utilization > 10%) 
Message: Unusually high broadcasts 

Description: The number of broadcast frames on the LAN is unusually high. The rule is combined 

with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. Refer to 12.1. 

Recommendations: • High broadcasts without high unicasts may precede broadcast storms. 

• ARP (the address resolution protocol used in IP networks) sends broadcast 
packets to locate stations that have a particular IP address, or a router that can 
forward a packet to that IP address. An unusually high number of broadcast 
frames may indicate problems in ARP. 

• Similarly, any other protocol that uses broadcast frames to locate other systems or 
services may be having a problem finding those systems or services. 

• A new application or protocol may have been added to the Extended LAN that 
uses broadcast frames. 



(DFM, Multicasts above 99.9 percentile) AND 15/60 min Warning 
(TOT, Bandwidth Utilization > 10%) 
Unusually high multicasts 

The number of multicast frames on the LAN is unusually high. The rule is combined 
with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. Refer to 12.1. 

Multicast frames are used like broadcast frames in location protocols to find systems 
or services. Like broadcast frames, they flood through the entire extended LAN. 
However, because multicasts are protocol specific, only hosts participating in the 
protocol receive them and must process them. 
Recommendations: • High multicasts without high unicasts may precede multicast storms. 

• A protocol that uses multicast frames to locate other systems or services may be 
having a problem finding those systems or services. 

• A new application or protocol may have been added to the Extended LAN that 
uses multicast frames. 



Token Ring (DFM, Unicasts above 99.9 percentile) AND 15/60 min Warning 

(TOT, Bandwidth Utilization > 10%) 
Message: Unusually high unicasts 

Description: The number of unicast frames on the LAN is unusually high. The rule is combined 

with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 
small, values. 

Recommendations: • A new application or protocol may have been added to the Extended LAN. 



MIB2LAN Port (DFM, Frames In above 99 percentile) AND 1 5/60 min Warning 

(TOT, Bandwidth Utilization In > 10%) 

(DFM, Frames Out above 99 percentile) AND 
(TOT, Bandwidth Utilization Out > 10%) 
Message: Unusually high frames in 

Unusually high frames out 



Token Ring 

Message: 
Description: 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Description: The number of frames on the LAN is unusually high. The rule is combined with a test 

that Bandwidth Utilization is > 10% to filter out unusually high, but very small, 

values. 

Recommendations: • A new application or protocol may have been added to the Extended LAN. 



MIB2LAN Port (DFM, Non-Unicast Frames In above 99 15/60 min Warning 

percentile) AND (TOT, Bandwidth Utilization In 
> 10%) 



(DFM, Non-Unicast Frames Out above 99 
percentile) ) AND (TOT, Bandwidth Utilization 
Out > 10%) 

Message: Unusually high non-unicast frames in 

Unusually high non-unicast frames out 
Description: The number of non-unicast frames on the LAN is unusually high. The rule is combined 

with a test that Bandwidth Utilization is > 10% to filter out unusually high, but very 

small, values. 

MIB2LAN elements combine broadcast and multicast frames into a category called 
non-unicast frames. As described above, broadcast and multicast frames are often used 
in protocols to locate systems and services. 



Recommendations: • High broadcasts and multicasts without high unicasts may precede broadcast 

storms. 

• ARP (the address resolution protocol used in IP networks) sends broadcast 
packets to locate stations that have a particular IP address, or a router that can 
forward a packet to that IP address. An unusually high number of broadcast 
frames may indicate problems in ARP. 

• Similarly, any other protocol that uses broadcast or multicast frames to locate 
other systems or services may be having a problem finding those systems or 
services. 

• A new application or protocol may have been added to the Extended LAN that 
uses broadcast or multicast frames. 
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4 WAN Profiles 

The WAN profiles apply to elements whose types include WAN and Server WAN. Thus this profile can be applied 
to a LAN/WAN group with the appropriate elements, or a Server group, again based on their interface speeds. 



Profiles supported include Delay profiles, Failure profiles, and Unusual Workload profiles. 

4.1 WAN - Delay Profiles 

Separate WAN Delay profiles are provided for different ranges of link speeds. The following table describes them. 



Profile 


Supported Speed Links 


Link Speed Range 


56K profile 


low speed links 


links with speed 256 < Kbps. 


Tl profile 


moderate speed links 


links with speeds from 256 Kbps to 3 Mbps. 


T3 profile 


high speed links 


links with speed above 3Mbps. 



The main difference between the profiles is in the acceptable bandwidth utilization. Higher speed links can support a 
higher utilization than lower speed links. This is seen in the threshold used in the over utilized alarms. Otherwise, 
these profiles are the same, and all are described in Table 10. 
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Table 10 WAN 56K - Delay, WAN Tl - Delay, and WAN T3 - Delay 

Element Type Rule, Trend Variable, Threshold Window Severity 

WAN TOT, Bandwidth Utilization In > x% 1 5/60 min Minor 

TOT, Bandwidth Utilization Out >x% 
Message: Over Utilized In 

Over Utilized Out 

Description: The WAN link is carrying too much traffic In or Out. As traffic builds on an outbound 

link, when a frame arrives that is to be sent on that link, it will be queued until the link 
becomes free. Since each frame must wait for the frames queued in front of it to be 
serviced, longer queues add more delay to the latency of the packet. 
The faster the link, the higher the utilization that can be supported. Three speed ranges 
are supplied, 56K, Tl, and T3. 

The 56K profile supports low speed links, links with speed 256 < Kbps, x = 50%. 
The Tl profile supports moderate speed links, links with speed 3 Mbps, x - 75%. 
The T3 profile supports high speed links, links with speed above 3Mbps, x - 90%. 

Recommendations: • Get a faster circuit, for example, upgrade a 128 Kbps ISDN link to a fractional Tl 

at 256 Kbps. 

• Setup up a parallel circuit, and split the traffic equally between the two circuits. 

• Reroute traffic, if you have a mesh network with redundant paths, you may be 
able to change the routing to direct some of the traffic to follow an alternate path. 

• Add a direct circuit to divert traffic off this circuit. For example, if the Los 
Angeles to Chicago circuit is too busy, and a large fraction of the traffic on the 
circuit is destined for Atlanta, add a direct circuit from Los Angeles to Atlanta to 
offload that traffic. 

• Prioritize the traffic carried over the circuit, and use traffic shaping and policing to 
ensure high priority traffic gets through with minimal delay, at the cost of 
delaying the low priority traffic (or even discarding it). 

Entire books have been written on network design and redesign. To dig deeper, start 
with Designing Wide Area Networks and Internetworks: A Practical Perspective? 



WAN TOT, Discards Out % > 1 % 1 5/60 min Warning 

Message: Too many discards out 

Description: When a queue grows, eventually the router, host, or switch will run out of buffers to 

hold the queued frames, and any additional frames that should be sent out the interface 
will be discarded. Discards are normal in IP networks because the TCP protocol is 
designed to drive the bottleneck link to saturation. The resulting congestion is then 
signaled back to the TCP sender as discarded (lost) packets. Too many discards lower 
the overall network efficiency, as the discarded packets must be resent. 



3 Marcus, J. Scott, Designing Wide Area Networks and Internetworks: A Practical Perspective, Addison-Wesley, 
Reading, Mass, 1999, ISBN 0-201-69584-7. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: • While most discards are due to queueing discards, there are other reasons a router 

may discard packets. Depending on the device, see if any of these other reasons 

may be causing discards. 

• If the link is over utilized, deal with it as described in the discussion of the Over 
Utilized alarms above. Note this may only move the bottleneck to another link. 
After increasing the speed, look to see if other links in the path are now seeing too 
many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 

• Implement RED (Random Early Discards) on the link. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 

on UDP, or protocols other than TCP/IP protocols, RED may not affect them. 
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4.2 WAN - Failure Profile 

A single WAN failure profile is provided that applies to all WAN elements. It detects outright failures and too many 
errors, see Table 11. 

Table 11 WAN- Failure 

Element Type Rule, Trend Variable, Threshold Window Severity 

WAN Availability 30 min Critical 

Message: Link Down 

Description: The link has gone down. 

Recommendations: • Correct the problem. 

• If the link is normally down, you could disable polling to stop EHealth from 
polling and alarming on this link. 

WAN TOT, Discards In % > 1% 15/60 min Major 

Message: Too many discards in 

Description: Too many frames were discarded after they were received. 

A frame going through a router or switch gets processed by three processes, a 
receiving process (frames in), a forwarding process, and a sending process (frame out). 
The frame (packet) can be discarded (lost) in any of the three processes. 

Packets lost in sending are generally lost due to queue losses (refer to the discussion of 
too many discards out in Table 10). 

Packets lost in forwarding are generally lost because the destination is unknown or 
unreachable. 

Packets lost in receiving (In frames) can be lost for a variety of reasons. 

• The receive process may not have enough buffers to hold the incoming frames. 

• The router or switch may use input queueing a technique where frames are 
buffered (queued) in the receiving interface hardware. Other designs are 

• shared memory where a central memory is used to hold frames. In this design 
the receiving interface and sending interface access the shared memory to 
read and write the frames. 

• output queueing where the frames are held in buffers in the sending interface. 
When a router or switch uses input queueing, if an outbound link is too busy, the 
input queues in the receiving interface fill up, and discard frames. 



Recommendations: • Increase the amount of memory (buffers) allocated to the receiving interface. 

• If the router or switch uses input queueing, check the output interfaces to see 
which (if any) are too busy. If any are, solve that problem. 

• Check to see if there are any error conditions which may be causing this interface 
to discard frames. 



WAN TOT, Errors % > 1 % 1 5/60 min Major 

Message: Too many errors 

Description: Any frame which cannot be received or sent due to an error is counted here. If too 

many frames have errors (as measured as a percentage of total frames), the system 
performance will be degraded. Further, many errors are indicators of problems that 
may lead to failure of the link, interface, or router. 

Recommendations: The kind of errors determine what may be wrong. eHealth groups all errors together. 
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4.3 WAN - Unusual Workload Profile 

A single WAN unusual workload profile applies to all WAN elements. It detects when the workload on a link 
changes significantly. 

This profile works best when applied to WAN links used by many users. 



Table 12 WAN - Unusual Workload 



Element Type 


Rule, Trend Variable, Threshold Window Severity 


WAN 


• (DFM, Frames Out above 99.9 percentile) 1 5/60 min Warning 




AND (PFM, Frames Out 10% above mean) 




• (DFM, Frames In above 99.9 percentile) AND 




(PFM Frames In 1 0% above meanl 


Message: 


• Unusually High Frames Out 




• TTnusiifillv HioH Frmnps In 


Description ! 


The traffic as measured by the number of Frames In or Out, is unusually high. The rule 


requires at least a 10% increase in the number of frames/sec. Refer to section 12. 1. 


Recommendations: 


• Drilldown to a Trend report to see how the current data compares to the normal 




range. 




• Drilldown to an AAG report to diagnose the current values of a number of key 




variables for this WAN link. 




• If the Utilization In or Out is high, the WAN link may be causing delay, refer to 




the discussion in Table 1 0 for recommendations. 




• If the number of frames is Unusually High In and Out, and the Average Frame 




Si7e is small the WAN link mav he cam/in? an unusuallv hi£?h number of control 




frames. This may indicate a protocol problem. The Average Frame Size is a trend 




variable. 




• A new application or a new group of users may now be using this link. In these 




cases, the alarm should remain active for a long time. 


WAN 


• (DFM Frames Out below 99 9 nercentile"! 15/60 min Warning 




AND (PFM, Frames Out 10% below mean) 




• (DFM, Frames In below 99.9 percentile) AND 




(PFM, Frames In 10% below mean) 


Message: 


• Unusually Low Frames Out 




• Unusually Low Frames In 


Description: 


The traffic as measured by the number of frames In or Out, is unusually low. The rule 


requires at least a 10% increase in the number of frames/sec. Refer to section 12.1. 


Recommendations: 


• Drilldown to a Trend report to see how the current data compares to the normal 




range. 




• Drilldown to an AAG report to diagnose the current values of a number of key 




variables for this WAN link. 




• If the traffic on the link is low during a period when the traffic is expected to be 




very predictable this may indicate a problem with an application. For example, 




every night at midnight, a file is transferred from a branch office to headquarters. 




Unusually low frames on the WAN link between the branch office and 




headquarters may indicate the file transfer foiled. 




• If the alarm remains active for a long time, it could mean an application or a 




group of users are no longer using this link. 
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5 Frame Relay Profiles 

Frame Relay Ports are WAN Links and should be monitored with the appropriate WAN profiles. These profiles 
apply to Frame Relay circuits (DLCIs). Frame Relay circuits should have their speed set to Committed Information 
Rate (CIR). 

Frame Relay circuits are used by enterprises to send data over the long distances (the WAN). They are purchased 
from a Frame Relay service provider, an organization that builds a Frame Relay network and sells bandwidth in the 
form of Frame Relay circuits to enterprises. Profiles are provided for the enterprise customer of Frame Relay 
services and for the service provider. 

An enterprise customer of a Frame Relay circuit purchases a physical WAN serial link (often a Tl link) at each 
router to connect them to the Frame Relay network. These are called access links. The customer and the service 
provider set up a Frame Relay circuit over the two access links. A Frame Relay circuit is identified at each end of a 
link by a DLCI, a Data Link Circuit Identifier. Each access link can carry multiple Frame Relay circuits. It is very 
common for a customer of a Frame Relay service to buy a single link to attach a central router or switch located at 
corporate headquarters to a Frame Relay network. The headquarters access link carries many Frame Relay circuits, 
each leading to a different remote office. 

The service provider provisions the Frame Relay network to connect the two ends of each circuit together. Frame 
relay circuits often are provisioned through multiple Frame Relay switches and are carried across multiple trunks 
within the Frame Relay network. The Frame Relay service provider lowers cost by selling the bandwidth of the 
trunks to carry circuits for many customers. Indeed, the Frame Relay service provider may over subscribe the trunk 
bandwidth. 

Frame relay circuits have a CIR that is less than or equal to the speed of the underlying link to access the Frame 
Relay network. The CIR is measured in bits per second, and represents the largest average rate at which data is 
guaranteed to be delivered over the circuit. A user of a Frame Relay circuit can send traffic at a rate faster than CIR, 
however, the Frame Relay service provider does not guarantee the delivery of that portion of the data "over CIR". 
Since the cost of a Frame Relay circuit depends on the CIR purchased, some users feel data sent over CIR is "free" 
bandwidth. 

When there is too much traffic through a switch or over a trunk within the Frame Relay network, the trunk or switch 
can become congested. When a trunk or switch is congested depends on the policies of the Frame Relay service 
provider and on the capabilities of the underlying switch. Different switches built by different vendors have different 
policies and techniques for identifying and responding to congestion. Customers of a Frame Relay service will have 
to contact their service provider for a precise definition of congestion. Because many circuits share trunks and 
switches, congestion on them can affect all customers, not just the circuits contributing the traffic that causes the 
congestion. Frame relay circuits are bi-directional (data can be sent in both directions), and the congestion may only 
affect traffic sent in one direction. 

When a Frame Relay network is congested, it may send congestion notifications to the sender and receiver of the 
traffic. The congestion notification sent back to the sender is called a Backward Explicit Congestion Notification, or 
BECN. The congestion notification sent with the congested traffic to the receiver is called a Forward Explicit 
Congestion Notification, or FECN. These terms are defined from the point of view of the Frame Relay switches 
inside the Frame Relay network. For the customer of a Frame Relay service, receiving a BECN on a Frame Relay 
circuit indicates that the data that was sent over the circuit encountered congestion. The customer should respond by 
sending less traffic 4 . Receiving a FECN on a Frame Relay circuit indicates that the data that was received over the 
circuit encountered congestion. The customer should respond by having the far end of the circuit send less traffic. 



4 At least that's what the Frame Relay specifications and the network service provider would like the user to do. But 
few routers or switches change their behavior in response to FECNs or BECNs. 
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One way a Frame Relay network can respond to congestion is to discard frames. One way to control which frames 
are discarded is to discard frames with the Discard Eligible (DE) flag set. The discard eligible flag can be set in two 
ways. The customer of the Frame Relay service could flag certain frames as discard eligible. The customer could 
flag frames over CIR as discard eligible, or the customer could prioritize traffic based on type, and flag lower 
priority traffic as discard eligible. Frame relay service providers can also flag frames as being discard eligible. Most 
often, this is done by the first Frame Relay switch inside the frame network receiving the traffic -- that is, the Frame 
Relay access switch where the access link from the customer terminates. If this access switch receives data at a rate 
over CIR from the customer it can respond in a number of ways: 

1 . It can simply let the frames into the network and do nothing. 

2. It can flag some of the frames as discard eligible. This is known as traffic marking. 

3. It can buffer (delay) frames, lowering the rate they are introduced into the network to lower the data rate to CIR. 
This is known as traffic shaping. 

4. It can simply decide to discard enough frames to lower the data rate to CIR. This is known as traffic policing. 
Traffic marking, shaping, or policing can be applied anywhere within the Frame Relay network. 

5.1 Frame Relay for the Enterprise Profiles 

Two profiles are provided for Frame Relay customers: 

• Frame Relay for the Enterprise - Delay, see Table 1 3 . 

• Frame Relay for the Enterprise — Failure, see Table 14. 

Table 13 Frame Relay for the Enterprise - Delay 

Element Type Rule, Trend Variable, Threshold Window Severity 

Frame Relay TOT, Bandwidth Utilization In > 150% 15/60 min Warning 

TOT, Bandwidth Utilization Out> 150% 
Message: Over CIR in 

Over CIR out 

Description: This alarm indicates that the circuit is sending (out) or receiving (in) traffic 

significantly over CIR. 

Recommendations: While traffic over CIR is not itself a problem, it may lead to increased delay in the 
Frame Relay network due to congestion within the Frame Relay network or because 
the Frame Relay network loses data. This alarm warns that data is at risk. The steps 
you can take to lower the bandwidth utilization are much the same as for any WAN 
link: 

• Increase the CIR of the circuit. 

• Reroute traffic. If you have a mesh network with redundant paths, you may be 
able to change the routing to direct some of the traffic to follow an alternate path. 

• Add a direct circuit to a divert traffic off this circuit. For example, if the Los 
Angeles to Chicago circuit is too busy, and a large fraction of the traffic on the 
circuit is destined for Atlanta, add a direct circuit from Los Angeles to Atlanta to 
offload that traffic. 

• Prioritize the traffic carried over the circuit, and use traffic shaping and policing to 
ensure high priority traffic gets through with minimal delay, at the cost of 
delaying the low priority traffic (or even discarding it). 

Entire books have been written on network design and redesign. To dig deeper, start 
with Scott Marcus' book "Designing Wide Area Networks and Internetworks". 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Frame Relay TOT, BECN In % > 2% 1 5/60 min Minor 

Message: Congestion in network on outbound data sent 

Description: The Frame Relay network is indicating that it is congested on the data being sent by 

this host over the circuit to the other end. 

Recommendations: • If this alarm persists without one of the two alarms listed below (Congestion in 

outbound data sent under CIR or Congestion in outbound data sent over CIR) 
being raised, then the traffic is encountering congestion, when it is sustaining 
traffic loads near CIR. Refer to section 5.2 for a description of recommended 
actions to take. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Frame Relay TOT, BECN In % > 2% AND 1 5/60 min Minor 

TOT, Bandwidth Utilization Out > 150% 
Message: Congestion in network on outbound data sent over CIR 

Description: The Frame Relay network is indicating that it is congested on the data being sent by 

this host over the circuit to the other end, and when it is congested, the circuit is 
carrying traffic significantly above CIR. 

Recommendations: • Refer to the general discussion of Frame Relay congestion in section 5.2 for a 

description of recommended actions to take. 



Frame Relay TOT, BECN In % > 2% AND 1 5/60 min Minor 

TOT, Bandwidth Utilization Out < 50% 
Message: Congestion in network on outbound data sent under CIR 

Description: The Frame Relay network is indicating that it is congested on the data being sent by 

this host over the circuit to the other end, and when it is congested, the circuit is 

carrying traffic significantly below CIR. 
Recommendations: • Refer to the general discussion of Frame Relay congestion in section 5.2 for a 

description of recommended actions to take. 

Frame Relay TOT, FECN In % > 2% 15/60 min Minor 

Message: Congestion in network on inbound data received 

Description: The Frame Relay network is signaling that the traffic received on the circuit 

encountered congestion as it passed through the network from the sender. 

Recommendations: • If this alarm persists without one of the two alarms listed below (Congestion in 

inbound data received under CIR or over CIR) being raised, then the traffic is 
encountering congestion, when it is sustaining traffic loads near CIR. Refer to the 
discussion in section 5.2 for a description of recommended actions to take. 



Frame Relay TOT, FECN In % > 2% AND 1 5/60 min Minor 

TOT, Bandwidth Utilization In > 150% 

Message: Congestion in network on inbound data received over CIR 

Description: The Frame Relay network is signaling that the traffic received on the circuit 

encountered congestion as it passed through the network from the sender, and when 
the congestion was seen, the circuit was receiving traffic significantly above CIR. 

Recommendations: • Refer to the general discussion of Frame Relay congestion in section 5.2 for a 

description of recommended actions to take. 



Frame Relay TOT, FECN In % > 2% AND 1 5/60 min Minor 

TOT, Bandwidth Utilization In < 50% 
Message: Congestion in network on inbound data received under CIR 

Description: The Frame Relay network is signaling that the traffic received on the circuit 

encountered congestion as it passed through the network from the sender, and when 
the congestion was seen, the traffic was significantly below CIR. 
Recommendations: • Refer to the general discussion of Frame Relay congestion in section 5.2 for a 
description of recommended actions to take. 
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Table 14 Frame Relay for the Enterprise - Failure 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Frame Relay 
Message: 
Description: 
Recommendations: 



Frame Relay 

Message: 

Description: 



Recommendations: 



30min 



Critical 



Availability 
Frame Relay Circuit Down 

The Frame Relay circuit is down. 

• Check if the underlying WAN port access link is down. 

• Check to see if the far end router or WAN port access link is down. 

• Check to see if the Frame Relay circuit has been administratively turned off at 
either end. 

• If none of the above are true, then check with the Frame Relay service provider to 
see if there is a problem within their network.. 



15/60 min 



Minor 



TOT, Errors % > 0.5% 
Too many errors 

The Frame Relay circuit has encountered errors. Most errors occur when a frame is 
being sent or received over the Frame Relay circuit. Errors when a frame is sent often 
occur because of problems within the sending interface. Errors when a frame is 
received could indicate problems in the receiving interface, or it could represent CRC 
errors where the frame is corrupted within the Frame Relay network. 
» Determine the kinds of errors the circuit is experiencing and correct them. 



5.2 Diagnosing Congestion Problems in Frame Relay Circuits 

Congestion problems in Frame Relay circuits can be difficult to diagnose. When an alarm indicates that a circuit is 
congested, there are a number of things to check: 

1 . Which direction is encountering congestion? The profile distinguishes between congestion encountered in each 
direction. 

2. What traffic (bandwidth utilization) is the circuit carrying when the congestion is encountered? This profile 
identifies three cases: when traffic is over CIR or under CIR when congestion occurs, and when neither are true, 
meaning traffic roughly equals CIR when congestion occurs. 

3. Where is the congestion occurring? At the sending access link, at the receiving access link, or internally within 
the network. 

a) You can identify if an access link is over utilized by examining the bandwidth utilization in and the 
bandwidth utilization out of the WAN port. 

For example, say a circuit carrying traffic from Atlanta to Boston is showing alarms at each end. You see 
Congestion in network on outbound data sent at the Atlanta end of the circuit, and Congestion in network on 
inbound data received at the Boston end of the circuit. These alarms are consistent, and indicate congestion in the 
traffic sent from Atlanta to Boston. 

The circuit has a CIR of 128 Kbit/sec, and the access lines in both ends are Tl (1 .544 Mbit/sec) links. While the 
Atlanta Access link carries only the single circuit, the Boston access link carries 1 0 circuits. 

The Atlanta-Boston circuit is carrying about 140 Kbit/sec of traffic from Atlanta to Boston. The Bandwidth 
Utilization Out measured at Atlanta is only 9% of the port's speed. Clearly, the access port out of Atlanta is not the 
source of the congestion. 

However the circuit has a Bandwidth Utilization of about 1 10%, which means the circuit is being given traffic 
slightly above CIR This could be why the network is indicating congestion. 

The Bandwidth Utilization In on the Frame Relay circuit at Boston is also about 1 10% (140 Kbit/sec). So we aren't 
losing much of the traffic. 
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But the total Bandwidth Utilization In of the Tl port in Boston for the access link carries about 1.250 Mbit/sec 
traffic for 10 Frame Relay circuits, and its Bandwidth Utilization In is about 81%. This link may well be the source 
of the congestion. The traffic may getting through the Frame Relay network, only to encounter congestion at the 
outbound access link. The three Delay- WAN profiles for 56K, Tl, and T3 will raise an alarm when the access port 
is over utilized, and will distinguish between inbound and outbound directions. 

With answers to those three questions, here are recommended actions for many of the cases. 

1 . If either access link is over utilized, then fix the over utilization as described in section 4. 1 and Table 10. 

2. If neither access link is over utilized, and the traffic on the Frame Relay circuit is significantly less than OR, 
then the congestion may be within the network, and not directly caused by your traffic. 

a) Check with your network service provider to see if there is a congestion problem within the network. 

3. If the traffic is significantly greater than OR, then the congestion may be specific to this circuit. 

a) Increase the CIR of this circuit. A Trend report of bandwidth utilization should show how much bandwidth 
(CIR) is needed. 

b) Reroute traffic off from this circuit. If you have a mesh network with redundant paths, you may be able to 
change the routing to direct some of the traffic to follow an alternate path. 

c) Add a direct circuit to divert traffic off this circuit. For example, if the Los Angeles to Chicago circuit is 
too busy, and a large fraction of the traffic on the circuit is destined for Atlanta, add a direct circuit from 
Los Angeles to Atlanta to offload that traffic. 

d) If you have installed at probe at either end of the circuit, use Traffic Accountant reports to determine which 
application, and nodes are using the circuit most. 

e) Prioritize the traffic carried over the circuit, and use traffic shaping or policing to ensure high priority 
traffic gets through with minimal delay, at the cost of delaying the low priority traffic (or even discarding 
it). 

f) Add another circuit to carry the different traffic flows. 

4. If the traffic roughly equals CIR (that is neither case 2 nor case 3 are true) then any of the above actions may 
help, 

5.3 Frame Relay for the Service Provider Profiles 

The Frame Relay service provider must balance the requirements of many customers and ensure that all customers 
receive the service levels for which they contract. While the service provider has more tools at its disposal to 
measure and control delay and failures; the errors, discards, and latency introduced by each Frame Relay switch 
accumulate for all the switches through which a circuit passes. The service provider profiles account for this by 
using slightly lower thresholds than the enterprise profiles. 

The two service provider profiles are: 

• Frame Relay for the Service Provider - Delay, see Table 15. 

• Frame Relay for the Service Provider - Failure, see Table 16. 
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Table 15 Frame Relay for the Service Providers - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Frame Relay 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, Bandwidth Utilization In > 150% 
TOT, Bandwidth Utilization Out > 150% 
Over utilized in 
Over utilized out 

The traffic on this circuit is well over Committed Information Rate (OR), either 
inbound or outbound to the frame relay switch. 

If this circuit is an access circuit, then 

• Over Utilized In indicates the traffic received from the customer is over OR, 
while 

• Over Utilized Out indicates the traffic sent to the customer is over OR. 

• If the traffic is consistently over OR, the customer may wish to increase the OR 
of the circuit. 

• For Over Utilized In on an access circuit, consider applying traffic policing or 
traffic shaping to control the overload. 



Frame Relay 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, BECN In % > 2% 
Backward congestion received from downstream 

The switch has received backward congestion indications (BECNs) from the 
downstream switch. These BECNs will be sent back upstream to the next switch closer 
to the sender. 

• If this is an internal trunk, one of the switches downstream (towards the receiver 
of the data) is congested. 

• If this is a NNI (Network Network Interface) connection to another Frame Relay 
network, then the congestion is in the network on the other side of the NNI. 
Forward the problem to the other network provider for resolution. 



Frame Relay 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, BECN Out % > 2% 
Backward congestion sent upstream to sender 

The switch has sent BECNs, backward congestion indications, upstream on this 
circuit, towards the sender of the data that is congested. The BECNs sent combine both 
BECNs received from downstream and any congestion indications generated within 
this switch. 

• Determine if the BECNs are internally generated or simply passed on by this 
switch. 

• Examine the BECNs received on the circuit this circuit is cross connected to 
within the switch. 

• If they are comparable to the BECNs Out on this circuit, then trace the circuit 
downstream to find out where the congestion is occurring. 

• If the BECNs Out on this circuit are more than the BECNs received on the cross- 
connect circuit, then there is congestion within this switch. The rules for 
determining when a circuit is congested vary with different switch manufacturers 
using different rules. 

• Examine the outbound utilization of the port carrying the circuit to which this 
circuit is cross-connected. If the outbound utilization of that port is high, then the 
queues will grow, and all the traffic carried on that network interface will be 
delayed and congested. The port should be showing alarms. Most of the circuits 
carried over this port should also show alarms as well. 



Frame Relay 
Message: 



TOT, FECN In % > 2% 

Forward congestion received from upstream 



15/60 min 



Minor 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Description: The switch has received congestion indications (FECNs) from the upstream switch. 

This indicates that the traffic was congested upstream (closer to the sender). These 
FECNs will be passed on downstream. 
Recommendations: • If this network interface is an NNI, an interface to another network, the congestion 

was within that other network. Forward the problem to the other network for 
resolution. 

• If the interface is an internal trunk, then trace the circuit back upstream to find the 
source of the congestion. 



Frame Relay 

Message: 

Description: 



Recommendations: 



TOT, FECN Out % > 2% 1 5/60 min Minor 

Forward congestion sent downstream to receiver 

The switch has sent FECNs downstream to the receiver of the data indicating 
congestion. The FECNs sent combine the FECNs received from upstream with those 
locally generated within the switch. 

• Determine if the FECNs are internally generated or simply passed on by this 
switch. 

• Examine the FECNs received on the circuit this circuit is cross connected to 
within the switch. 

• If they are comparable to the FECNs Out on this circuit, then trace the circuit 
upstream to the next switch to find out where the congestion is occurring. 

• If the FECNs Out on this circuit are more than the FECNs received on the cross 
connect circuit, then the congestion is within this switch. The rules for 
determining when a circuit is congested vary with different switch manufacturers 
using different rules. 

• Examine the outbound utilization of the port carrying this circuit. If the outbound 
utilization of that port is high, then the queues will grow, and all the traffic carried 
on that network interface will be delayed and congested. The port should be 
showing alarms. Most of the circuits carried over this port should also show 
alarms as well. 



Frame Relay TOT, Discards % > 1 % 1 5/60 min Minor 

Message: Too many discards 

Description: When a queue grows, eventually the Frame Relay switch will run out of buffers to 

hold the queued frames, and any additional frames that should be sent out the interface 
will be discarded. Discards are normal in IP networks because the TCP protocol is 
designed to drive the bottleneck link to saturation. The resulting congestion is then 
signaled back to the TCP sender as discarded (lost) packets. Too many discards lower 
the overall network efficiency, as the discarded packets must be resent. 

Recommendations: While most discards are due to queueing discards, there are other reasons a Frame 

Relay switch may discard packets. Depending on the switch, see if any of these other 
reasons may be causing discards: 

• If the link is over utilized, deal with it as described above. Note this may only 
move the bottleneck to another link. After increasing the speed, look to see if 
other links in the path are now seeing too many discards or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 

the discard rate significantly. 
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Table 16 Frame Relay for the Service Provider - Failure 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Frame Relay 
Message: 
Description: 
Recommendations: 



30 min 



Critical 



Availability 
Frame Relay Circuit Down 

The Frame Relay circuit is down. 

• Check to see if the WAN port carrying the Frame Relay circuit is down. 

• Check to see if the adjacent switch or the WAN port on the other end of the link is 
down. 

• Tracing the circuit, see if any other switch or link is down. 

• Check to see if either customer router or LAN switch is down, or if either access 
link is down. 

• If the circuit will be down for an extended period, you could temporarily turn off 
polling for the circuit. 



Frame Relay 

Message: 

Description: 



Recommendations: 



TOT, Errors % > 0.5% 1 5/60 min Minor 

Too many errors 

The Frame Relay circuit has encountered errors. Most errors occur when a frame is 
being sent or received over the Frame Relay circuit. Errors when a frame is sent often 
occur because of problems within the sending interface. Errors when a frame is 
received could indicate problems in the receiving interface, or it could represent CRC 
errors where the frame is corrupted on the link. 

• Determine the kinds of errors the circuit is experiencing, and correct them. 



5.4 Frame Relay - Unusual Workload Profiles 



Table 17 Frame Relay - Unusual Workload 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Frame Relay 

Message: 

Description: 
Recommendations: 



DFM, Frames Out (/sec) above 99.9 percentile 15/60 min Warning 
DFM, Frames In (/sec) above 99.9 percentile 
Unusually High Frames Out 
Unusually High Frames In 

The traffic as measured by the number of Frames In or Out is unusually high. 

• Drilldown to a Trend report to see how the current data compares to the normal 
range. 

• Drilldown to an AAG report to diagnose the current values of a number of key 
variables for this WAN link. 

• If the Utilization In or Out is high, the WAN link may be causing delay. Refer to 
the discussion in Table 10 for recommendations. 

• If the number of frames is unusually high In and Out, and the Average Frame Size 
is small, the WAN link maybe carrying an unusually high number of control 
frames. This may indicate a protocol problem. The Average Frame Size is a trend 
variable. 

• A new application or a new group of users may now be using this link. In these 
cases, the alarm should remain active for a long time. 
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6 ATM Profiles 

ATM services can be used to provide a WAN link between routers or switches in different sites. ATM can also be 
used within a campus to connect LANs together. When ATM is used for WAN links, an Enterprise that uses the 
service buys the service (the link) from an ATM service provider. ATM profiles are provided for both the 
Enterprises who use the ATM service, and the ATM Service Provider. The Service Provider profile could be used at 
the provider's edge ATM switch where the traffic is initially received and policed, or inside the core of the network 
to monitor ports for excessive causes of delay (over utilized links, traffic out of spec, too many discards, etc.). 

When ATM is used within a campus, the Enterprise purchases and manages their own ATM switches. 

6.1 ATM for the Enterprise Profiles 

Three profiles are provided for Enterprise customers; they apply to routers or switches that access an ATM service 
network. These profiles are also appropriate for LAN switches or routers connected to campus ATM switches. 

• ATM for the Enterprise Tl - Delay, see Table 18, is appropriate for Tl/El links, and the paths and channels 
they carry. 

• ATM for the Enterprise T3 - Delay, see Table 1 8, is appropriate for T3 or faster links, and the paths and 
channels they carry. 

• ATM for the Enterprise - Failure, see Table 1 9, is appropriate for all ATM ports, paths, and channels. 
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Table 18 ATM for the Enterprise Tl - Delay, 
ATM for the Enterprise T3 - Delay 

Element Type Rule, Trend Variable, Threshold Window Severity 

ATM Port TOT, Bandwidth Utilization In > x% 1 5/60 min Minor 

TOT, Bandwidth Utilization Out >x% 
Message: Over Utilized In 

Over Utilized Out 

Description: The ATM Port is carrying too much traffic In or Out. As traffic builds on an outbound 

link, when a frame arrives that is to be sent on that link, it will be queued until the link 
becomes free. Since each frame must wait for the frames queued in front of it to be 
serviced, longer queues add more delay to the latency of the packet. 
The faster the link, the higher the utilization that can be supported. Profiles for two 
speed ranges are supplied, Tl and T3. 

The Tl profile supports Tl/El ATM ports and the paths and circuits they carry. I.e., 
ports whose speed is 1 .544 Mbps or 2.048 Mops. Here x = 75%. 
The T3 profile is for T3/E3, and higher speed ports like OC3, OC12, and beyond and 
the paths and circuits they carry, here x = 90%. 
Recommendations: • Get a faster port, for example, upgrade a Tl to a T3. 

• Setup up a parallel circuit, and split the traffic equally between the two circuits. 

• Reroute traffic, if you have a mesh network with redundant paths, you may be 
able to change the routing to direct some of the traffic to follow an alternate path. 

• Add a direct circuit to a divert traffic off this circuit. For example, if the Los 
Angeles to Chicago circuit is too busy, and a large fraction of the traffic on the 
circuit is destined for Atlanta, add a direct circuit from Los Angeles to Atlanta to 
offload that traffic. 

• Prioritize the traffic carried over the circuit, and use traffic shaping and policing to 
ensure high priority traffic gets through with minimal delay, at the cost of 
delaying the low priority traffic (or even discarding it). 

Entire books have been written on network design and redesign. To dig deeper, start 

with Scott Marcus' book "Designing Wide Area Networks and Internetworks" 3 . 

ATM Port TOT, Discarded Cells Out % > 0.5% 1 5/60 min Warning 

Message: Too many discarded cells out 

Description: When a queue grows, eventually the router, host, or switch will run out of buffers to 

hold the queued cells, and any additional cells that should be sent out the interface will 
be discarded. 

Discards are normal in IP networks because the TCP protocol is designed to drive the 
bottleneck link to saturation. The resulting congestion is then signaled back to the TCP 
sender as discarded (lost) packets. Too many discards lower the overall network 
efficiency, as the discarded packets must be resent. For ATM networks carrying IP 
data, loss of a single cell means the whole frame is lost. 

ATM networks carrying other kinds of data (Voice, Video, Switched SNA traffic) may 
be more or less sensitive to discarded cells. For example, Voice is less sensitive to 
discards, as an occasional lost cell can be tolerated. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: While most discards are due to queueing discards, there are other reasons a router or 

switch may discard cells. Depending on the device, see if any of these other reasons 

may be causing discards: 

• If the link is over utilized, deal with it as described above in the Over Utilized 
alarm. Note this may only move the bottleneck to another link. If the speed is 
increased, look to see if other links in the path are now seeing too many discards 
or are now over utilized. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 



ATM Port TOT, CLP 1 Cells In % > 1 0% 1 5/60 min Warning 

Message: Too many CLP1 cells in 

Description: ATM networks can mark cells based on their priority, the cell's priority is in the CLP 

(Cell Loss Priority) field, a single bit. If the CLP bit is 0, the cell has higher priority 
than cells where the CLP bit is 1 . Cells with CLP = 1 should be discarded in 
preference to cells with CLP = 0. The ATM network can mark cells with CLP = 1 
when they violate the traffic policies of the network, or when the ATM network is 
congested. 

Recommendations: Determine where the cells are being marked, and why: 

• Is this ATM port's Bandwidth Utilization In too high? If it is, the ATM network 
may be indicating congestion on its end of the ATM link. 

• Is one or more of the ATM circuits carried by this port violating its traffic policy? 
For example, an ATM circuit has been purchased with a CBR (Constant Bit Rate) 
service. But the circuit is actually being used for data traffic and the sending 
router or switch is treating the circuit as a UBR (Unspecified Bit Rate) circuit. 
Then it is likely the traffic will violate the traffic policy, and the network may 
mark the offending cells with CLP = 1 . The receiving Port will see the CLP = 1 
cells. 

• Is there congestion inside the ATM network? The ATM circuits coming in over 
this port likely share bandwidth with other circuits on links inside the ATM 
network. If the aggregate traffic from those circuits sharing a link overuse the 
link's bandwidth, you may see it as cells received with CLP - 1 . 

TOT, Bandwidth Utilization Out > 1 00% 1 5/60 min Warning 

TOT, Bandwidth Utilization In > 100% 
Traffic in over SCR 
Traffic out over SCR 

When an Enterprise customer of an ATM service buys a channel from a service 
provider, they may purchase a particular capacity in terms of the Sustainable Cell Rate, 
or SCR. This is the maximum bit rate that a user can offer to the service over a long 
period and have all of the cells carried. It is generally less than or equal to Peak Cell 
Rate, which is the maximum rate at which cells can be offered for a short period and 
still be carried. eHealth attempts to determine the SCR when the channel is discovered, 
and sets the speed to the SCR (as measured in bits/sec). 

This alarm is raised when the traffic in or out an ATM channel is above the SCR for 
more than 1 5 minutes out of the past hour. When the traffic is above the SCR, the 
ATM service may discard or delay cells. 



ATM Channel 

Message: 

Description: 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Recommendations: 



ATM Channel 

Message: 

Description: 



• Check to see if the speed is correctly set to the SCR of the circuit. If the SCR is 
higher (or lower), correct the speed in the poller configuration. 

• Purchase an increased SCR from the ATM service provider. 

• Reroute traffic. If you have a mesh network with redundant paths, you may be 
able to change the routing to direct some of the traffic to follow an alternate path. 

• Add a direct circuit to a divert traffic off this circuit. For example, if the Los 
Angeles to Chicago circuit is too busy, and a large fraction of the traffic on the 
circuit is destined for Atlanta, add a direct circuit from Los Angeles to Atlanta to 
offload that traffic. 

• Prioritize the traffic carried over the circuit, and use traffic shaping and policing to 
ensure high priority traffic gets through with minimal delay, at the cost of 
delaying the low priority traffic (or even discarding it). 

Entire books have been written on network design and redesign. To dig deeper, start 
with Designing Wide Area Networks and Internetworks: A Practical Perspective? 



15/60 min 



Warning 



TOT, AAL5 PDUs Discarded % > 1% 
Too many AAL5 frames discarded 

ATM Channels used to carry IP traffic or to carry frames between LAN switches in a 
campus, often carry those frames using AAL5. Since frames are larger than cells, 
AAL5 fragments a frame into multiple cells. The loss of any of those cells causes the 
entire frame to be discarded. It may be discarded at the ATM switch or ATM access 
device (the router or switch that connects to the ATM service). 

If a cell is lost due to an error, the entire AAL5 frame will be discarded if the ATM 
switch implements Partial Packet Discard (PPD). If a cell is lost due to congestion, or 
traffic policing actions causing cells to be discarded, the switch will discard all the 
following cells in that frameif the switch implements Early Packet Discard (EPD). 



Recommendations: 



Determine why cells are being discarded, and correct that problem. Note that a 
few cells discarded (1 in 1000) could easily cause 2% of the frames to be lost if a 
frame is fragmented into 20 cells. 

While most discards are due to queueing discards, there are other reasons a switch 
may discard cells. Depending on the device, see if any of these other reasons may 
be causing discards. 

If the link is over utilized, deal with it as described in the discussion of the Over 
utilized alarm above. Note this may only move the bottleneck to another link. 
After increasing the speed, look to see if other links in the path are now seeing too 
many discards or are now over utilized. 

Increase the number of buffers in the output queue. This is only appropriate if the 
link is not causing delay in the network, but is still discarding packets. If the link 
is causing significant delay, adding buffers can make it worse, without decreasing 
the discard rate significantly. 
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Table 19 ATM for the Enterprise - Failure 



Element Type 



Rul e, Trend Variable, Threshold 



Window 



Severity 



ATM Port 
Message: 
Description: 
Recommendations: 



30 min 



Critical 



Availability 
ATM Port Down 
The ATM Port is down. 
• Check to see if the problem is with this end of the link, or the other end. 



ATM Port 

Message: 

Description: 



Recommendations: 

ATM Port 

Message: 

Description: 

Recommendations: 

ATM Port 
Message: 
Description: 
Recommendations: 



15/60 min 



Minor 



TOT, Errored Seconds > 5 sec 
Too many seconds with errors 

The ATM port measures the number of seconds that have had errors in them. This 
alarm is raised if more than 5 seconds out of a poll period (typically 5 minutes) has 
errors, and the poll periods total more than 15 minutes out of the past hour. 
• Determine the kinds of errors included in the computation of errored seconds 
supported by the switch. Correct those problems. 



15/60 min 



Major 



TOT, Severely Errored Seconds > 0 sec 
Too many seconds with severe errors 

The ATM port measures the number of seconds that have been severely errored. There 
is a standard definition of severely errored seconds for SONET/SDH and DS1/DS3 
physical links. 

• Any severely errored seconds are a serious problem, and can lead to lost 
connections and link down. 



5/60 min 



Critical 



TOT, Unavailable Seconds > 0 sec 
Too many unavailable seconds 

An unavailable second is a second where the link is unusable. 
• An unavailable second without the link going down indicates an intermittent 
problem on the link that should be corrected. 



ATM Path 

Message: 

Description: 

Recommendations: 



ATM Channel 
Message: 
Description: 
Recommendations: 



30 min 



Critical 



Availability 
ATM Path down 

The ATM path is down. Since paths can carry multiple channels, this could indicate 
the failure of a number of channels. 

• Determine if the ATM port carrying the path is down. 

• See if the far end of the circuit is down. 
Check to see if there is a problem within the ATM network. 



Availability 30 min 

ATM Channel Down 

An ATM cannel is down. 

• Check to see if the underlying port or path is down. 

• See if the far end of the circuit is down. 

• Check to see if there is a problem within the ATM network. 



Critical 



6.2 ATM Service Provider Profiles 

Three profiles are provided for ATM service providers; they apply to the ATM ports, paths and channels on the 
ATM switches in the network. 

• ATM for the Service Provider Tl - Delay, Table 20, is appropriate for Tl/El links, and the paths and channels 
they carry. 

• ATM for the Service Provider T3 - Delay, Table 20, is appropriate for T3 or faster links, and the paths and 
channels they carry. 

• ATM for the Service Provider - Failure, Table 21, is appropriate for all ATM ports, paths, and channels. 
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Table 20 Delay - ATM for the Service Provider Tl 
Delay - ATM for the Service Provider T3 



xLiciiiciii iypc 


RiiIp TrpnH Variohlp ThrPchnlH 

IXUlC; 1 I CUU V al JitUlCf 1 111 CM1U1U 


WinHnw 

TT J JIUUVY 




ATM Port 


TOT, Bandwidth Utilization Out > x % 
lui, tsanawiatn utilization in > x /o 


15/60 min 


Minor 


Message: 


Over utilized out 
Over utilized in 






Description: 
xvcconiiiienuaiiuiiN* 








ATM Port 


TOT, Discarded Cells Out % > 0.5% 
1 U 1 , Discarded Cells in % > I). 3% 


15/60 min 


Minor 


Message: 


Too many discarded cells out 

i uu many uiMdi ucu iciia in 






Description: 
Recommendations: 


* 






ATM Port 


TOT, CLP1 Cells Out (%) > 10% 
IOT, CLP1 Cells In (%) > 10% 


15/60 min 


Minor 


Message: 


Too many CLP1 frames out 
Too many CLP1 frames in 






Description: 
Recommendations: 


• 






ATM Port 


TOT, CLP0 Discards Out % > 0.1% 


15/60 min 


Minor 


Message: 


Too many CLP0 frames discarded 






Description: 
Recommendations: 


• 






ATM Port 


TOT, Policy Violations In % > 10% 
TOT, Policy Violations Out % > 10% 


15/60 min 


Minor 


Message: 

Description: 
Recommendations: 


Too many policy violations in 
Too many policy violations out 

• 






ATM Path 


1U1, bandwidth Utilization Out> 100% 
TOT, Bandwidth Utilization In > 100% 


15/oU min 


Minor 


Message: 


Over utilized out 
Over utilized in 






Description: 

rxciuillilldiuauuiis* 








ATM Path 


TOT, CLP1 Cells %> 10% 


15/60 min 


Minor 


lviessage* 

Description: 

Recommendations: 


i oo many Lj_/ri iranies 
• 






ATM Path 


TOT, Discarded Cells In % > 0.5% 


15/60 min 


Minor 


Message: 

Description: 

Recommendations: 


Too many discarded cells out 5 
• 






ATM Path 


TOT, CLP0 Discards % > 0.1% 


15/60 min 


Minor 


Message: 


Too many CLPO frames discarded 






Description: 
Recommendations: 


• 







5 Probable bug, message is "out" variable is "in", something is messed up. 
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Element Type 


Rule, Trend Variable, Threshold 


Window 


Severity 


ATM Channel 

Message: 

Description: 
Recommendations: 


TOT, Bandwidth Utilization Out > 100% 
TOT, Bandwidth Utilization In > 100% 
Traffic out over SCR 
Traffic in over SCR 

• 


15/60 min 


Minor 


ATM Channel 

Message: 

Description: 
Recommendations: 


TOT, Discarded Cells Out % > 0.5% 
TOT, Discarded Cells In % > 0.5% 
Too many discarded cells out 
Too many discarded cells in 

• 


15/60 min 


Minor 


ATM Channel 
Message: 
Description: 
Recommendations: 


TOT, CLP1 Cells In (%) > 10% 
Too many CLP1 frames in 

• 


15/60 min 


Minor 


ATM Channel 
Message: 
Description: 
Recommendations: 


TOT, CLP0 Discards Out % > 0. 1% 
Too many CLP0 frames discarded 

• 


15/60 min 


Minor 
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Table 21 ATM for the Service Provider - Failure 


Element Type 


Rule, Trend Variable, Threshold Window Severity 


ATM Port 
Message: 
Description: 
Recommendations: 


Availability 30 min Critical 
ATM Port Down 
The ATM Port is down. 

• Check to see if the problem is with this end of the link, or the other end. 


ATM Port 

Message: 

Description: 

Recommendations : 


TOT, Errored Seconds > 5 sec 1 5/60 min Minor 
Too many seconds with errors 

The ATM port measures the number of seconds that have had errors in them. This 
alarm is raised if more than 5 seconds out of a poll period (typically 5 minutes) has 
errors, ana tne poll perioos total more tnan i j minutes out 01 xne past nour. 1 nis is a 
standard measure of errors on ATM links. 

• Determine the kinds of errors included in the computation of errored seconds 
supported by the switch. Correct those problems. 


ATM Port 

Message: 

Description: 

Recommendations: 


TOT, Severely Errored Seconds > 0 sec 1 5/60 min Major 
Too many seconds with severe errors 

The ATM port measures the number of seconds that have been severely errored. There 
is a standard definition of severely errored seconds for SONET/SDH 6 links. 
• Any severely errored seconds are a serious problem, and can lead to lost 
connections and link down. 


ATM Port 
Message: 
Description: 
Recommendations: 


l U i , Unavailable Seconds > 0 sec j/ou min critical 
Too many unavailable seconds 

An unavailable second is a second where the link is unusable. 
• An unavailable second without the link going down indicates an intermittent 
problem on the link that should be corrected. 


ATM Path 

Description: 

Recommendations: 


Availability 30 min Critical 
ATM Path down 

The ATM path is down. Since paths can carry multiple channels, this could indicate 

the failure of a number of channels. 

• Determine if the ATM port carrying the path is down. 


ATM Channel 
Message: 
Description: 
Recommendations: 


Availability 30 min Critical 
ATM Channel Down 

An ATM cannel is down. 

• Check to see if the underlying port or path is down. 

• See if the far end of the circuit is down. 

• Check to see if there is a problem within the ATM network. 



6.3 ATM - Unusual Workload Profiles 

The ATM - Unusual Workload profile is appropriate for all kinds of ATM ports, paths, and channels. 



6 Anyone know if there's a standard definition for severely errored seconds? 
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Table 22 ATM - Unusual Workload 



Element Type 


Rule, Trend Variable, Threshold Window Severity 


ATM Pnrt 


• rDFM Cells In ahove 99 9 nercentile'i AND 15/60 min Warning 




(TOT Ran H width Tn > ? WrA 

y 1 \J x , XJallUWlUUl III J /Q J 




m CTWAA Penile Out ukiwf» QQ Q -nfrPAntil^^ AND 




(TOT Rand width In > 2Wnl 


ivi essage. 


* T Iniic ii <i ll 1 *? l-iirrl* s»ollc? in 

* unusually nign tens in 




• TTthiciiqIIv hi oh nut 

• l/llUaUallj iligll LClia UUI 


Description: 


Thp nnmKpr rvF r«i=»11c in nr nnt orp lmncnflllv hioh The rillp alflTTYIS nnlv if fhp hflndwidtTl 
1 11C llUillUCl ul L-C11& 111 KjL UUI dlC UliuaUalljr Ulgll. 1 lie I Uiv aiaiiiio \jiny u nit uauuwiuui 


utilization on the nort is over Refer to section 12 1 


RapnmTnpnH^ti cms* 
I\CL.l/IHJIICIlU«lli/lii3* 


• Drill down to a Trend renort to see how the current data comnares to the normal 




range. 




• Drilldown tn an A ACt renort to diacmnse the current values of a number of kev 

w Ul lllUVWU IU ail jVtVvJ 1 tUUl I l\J U-iaglUJoV-- Lilt* wuill/lli vaiu^o a iiKiitiuy^i \jl ivv_y 




variables for this ATM Port. 




• Tf the T Jtilization In or Out is hiffh the Port mav be causing delav refer to the 




discussion in Table 18 for the Over Utilized alarms for recornmendations. 




• A n f^wr *>r\r\\\mt\f\n nr a nmv crrmm rvf ncprc mav nnw He usino this link Tn tnPSP 

• xv 11CW appilL-allUU Ui a I1CW gluup Ul UoCla may nuw uc; uoiug iiiia uiiiv. 111 Liitot 




cases, the alarm should remain active for a long time. 


ATM Path 
r\ l ivi x a ill 


• fDFM Cells Tn above 99 9 percentile^ AND 15/60 min Warning 




^TOT Randwidth Tn > 9S%^ 




• ^DrM, cells vjut aoove yy.y percentile^ ainij 




/'TOT Rand width Tn ^> 


Message: 


• Unusually high cells in 




• Unusually high cells out 


Description: 


The number of cells in or out are unusually high. The rule alarms only if the bandwidth 


liti 1 i*7dti/\n rvn thp» notli ic rv\7P»r O^^A l?<aTAr tc\ CPPtirtn 1/1 
UllllZallUIl Ull LUC pa 111 1S> UVC1 £J /0. IxClCl Ikr OCL-lUJll 


rvecoinincnuaiiojia. 


• T*iri11dmx/n tr\ q TrAnd -rAnrvrt in cpp hnw thp pnrrent data rntnnnrpQ to the normal 

* JL'1 111UUW11 IU a 1ICI1U icpuil IU oCC I1UW ilie vuiieiiL uaLa vLfiiipaiea iu uie iiuiiiidi 




ran oe 
i auge. 




• Thrill down to an A Afi renort to diacmose the current values of a number of kev 




variables for this ATM Path. 




• If the Utilization In or Out is high, the Path maybe causing delay, refer to the 




discussion in Error! Reference source not found, for recommendations. 
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cases, the alarm should remain active for a long time. 


ATM Phannpl 


• fTVRM Ppllc Tn nhovp QQ Q nerrentile^ AND 15/60 min Warning 

w ^UriVl, V^ClIa 111 aDUVC yy.y perCCIlLllv^ /\1N1-/ nun ttcuiiiu^ 




/'TOT Rand width Tn > 

^i^i, £>anuwiuin in z d /q j 
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Message: 


• Unusually high cells in 




• Unusually high cells out 


Description: 
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utilization on the channel is over 25%. Refer to section 12.1. 


Recommendations: 


• Drilldown to a Trend report to see how the current data compares to the normal 




range. 




• Drilldown to an AAG report to diagnose the current values of a number of key 




variables for this ATM Channel. 




• If the Utilization In or Out is high, the Channel may be causing delay, refer to the 




discussion in Error! Reference source not found, for recommendations. 




• A new application or a new group of users may now be using this channel. In 




these cases, the alarm should remain active for a long time. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

ATM Channel • (DFM, AAL5 PDUs In above 99.9 percentile) 1 5/60 min Warning 

AND (TOT, Bandwidth In > 25%) 

• (DFM, AAL5 PDUs Out above 99.9 
percentile) AND (TOT, Bandwidth In > 25%) 

Message: • Unusually high AAL5 PDUs in 

• Unusually high AAL5 PDUs out 

Description: The number of AAL5 PDUs in or out are unusually high. The rule alarms only if the 

bandwidth utilization on the channel is over 25%. Refer to section 12.1. 
Recommendations: • Drilldown to a Trend report to see how the current data compares to the normal 

range. 

• Drilldown to an AAG report to diagnose the current values of a number of key 
variables for this ATM Channel. 

• If the Utilization In or Out is high, the Channel may be causing delay, refer to the 
discussion in Error! Reference source not found, for recommendations. 

• A new application or a new group of users may now be using this channel. In 
these cases, the alarm should remain active for a long time. 



7 Router and Switch Profiles 

Three profiles are provided for routers and switches. All three apply to any kind of router or switch: 

• Router or Switch - Delay, see Table 23 

• Router or Switch — Failure, see Table 24 

• Router or Switch - Unusual Workload, see Table 25 
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7.1 Router or Switch - Delay Profile 



Table 23 Router or Switch - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Router, 

Router with CPU 

Message: 

Description: 



TOT, Average Line Utilization > 70% 



15/60 min 



Major 



Interfaces too busy 

The interfaces on this router are, in aggregate, too busy. Average Line Utilization is 
measured by summing the Bandwidth Utilization of each polled interface on the 
router, and dividing by the number of polled interfaces. 

If the router or switch has interfaces of widely different speeds this alarm won't detect 
problems where the slow speed links are too busy. For example, an Ethernet LAN 
interface with a utilization of 2% and a 56K WAN link with a utilization of 90%, the 
Router will have an average interface utilization of only 46%. To detect problems with 
the low speed links, the interfaces should be discovered as LAN/WAN interfaces, the 
interfaces should be placed in the appropriate groups, and the groups should be 
monitored with the appropriate profiles. 



Recommendations: • The interfaces on this router are seriously over used. Many of the interfaces 

should be upgraded in speed, or the traffic on the interfaces should be reduced. 

• If you are not monitoring the individual interfaces, you should do so now. 

Router, TOT, Average Packet Discards > 5% 15/60 min Major 

Router with CPU 

Message: Too many discards 

Description: The router is discarding too many packets. The average packet discards is the sum of 

the discards % for the polled interfaces, divided by the number of polled interfaces. 
Recommendations: • Drilldown to an AAG report for the router or switch to diagnose the router and see 

if there are related problems. 

• While most discards are due to queueing discards, there are other reasons a router 
may discard packets. Depending on the device, see if any of these other reasons 
may be causing discards. 

• If the interfaces are over utilized, deal with them as described above. 

• Increase the number of buffers in the output queue. This is only appropriate if the 
router is not causing delay in the network, but is still discarding packets. If the 
router is causing significant delay, adding buffers can make it worse, without 
decreasing the discard rate significantly. 

• Implement RED (Random Early Discards) on the router. RED is a technique 
supported by many routers and switches to signal congestion to TCP flows before 
the queue fills. This has proven extremely effective in lowering discards, and 
improving overall network performance. However, if most of the traffic is based 
on UDP, or protocols other than TCP/IP protocols, RED may not affect them. 



Switch Plus Backplane TOT, Backplane Utilization > 50% 1 5/60 min Major 

Message: Backplane over utilized 

Description: The switch backplane utilization is too high. When the backplane is too busy, the 

switch will delay a packet as it is transferred from the receiving interface to the 
interface the frame should be forwarded out. The backplane is the central bottleneck 
through which all packets must pass. 

Recommendations: • Lower the traffic through the switch, either by rerouting traffic around the switch, 

or by cutting the number of users it supports. 
• Upgrade the switch to one with a faster backplane. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Router CPU, TOT, CPU Utilization > 60% 1 5/60 min Major 

Switch CPU 

Message: CPU too busy 

Description: The CPU utilization is too high. When the CPU is too busy, frames may be delayed as 

the CPU cannot quickly decide how to forward this frame. Other functions performed 
by the CPU, such as processing routing updates may be delayed as well. 
Recommendations: • Review the functions being performed in the router. Is it performing extra 

processing that may not be needed? For example, is the router performing extra 
filtering on each packet? 
• Some routers allow the CPU to be replaced with a faster processor. Other routers 
allow additional processors to be added. 
• Upgrade the router to a newer, faster, router. 
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7.2 Router or Switch - Failure Profile 



Table 24 Router or Switch - Failure 



Element Type 


Rule, Trend Variable, Threshold Window Severity 


Rout er, 


Availability 30 min Critical 


Router with CPU 




Message: 


Router Down 


Description: 


The router went down. Note that eHealth cannot be sure the router was actually down 


until it has come back up. Refer to the general discussion in 12.4 Availability and 




Reachability Alarms for Hosts. 


Recommendations: 


• 


Router, 


Reachability 30 min Critical 


Router with CPU 




Message: 


Router unreachable 


Description: 


The router is unreachable, and may be down Network problems may also prevent the 




poller from reaching the router's IP Address. 


Recommendations: 


• Check to see if the device can be reached (for example, by sending a number of 




pings to the device). 




• Check if other routers between the eHealth console and the router are down. 




• If the router is up, and the reachability alarm persists, check the latency to the 




router If* it i«; often hi^h eHealth mav he seeini? timeouts on the ninf? 




• The nrohlem rmild he that the latencv to the router is too hiph Fix the 




network latency problem. 




• The problem could be that the eHealth ping timeout is set to low. Increase the 




ping timeout used by eHealth. 


Router, 


TOT, Errors In %> 2% 15/60 min Major 


Router with CPU 




Message: 


Too many errors in 


Description: 


The router or switch has too many errors when it is receiving frames. 


Recommendations: 


• Identify which interfaces are having the errors. If all of the LAN/WAN interfaces 




to the router or switch are being monitored on by eHealth using the appropriate 




Failure profiles, then a similar alarm should have been raised on the failing 




interface. 




• Errors In often include frames that are corrupted in transmission. 




• Errors In often include errors encountered within the receiving interface 




h ard ware/soft ware . 


Router, 


TOT, Errors Out % > 2% 1 5/60 min Major 


Router with CPU 




Message: 


Too many errors out 


Description: 


The router or switch has too many errors when it is sending frames. 


Recommendations: 


• Identify which interfaces are having the errors. If all of the LAN/WAN interfaces 




to the router or switch are being monitored on by EHealth using the appropriate 




Failure profiles, then a similar alarm should have been raised on the failing 




interface. 




• Errors Out often include errors encountered within the sending interface 




hardware/software. 


Router CPU 


TOT, Free Memory < 2000000 bytes 1 5/60 min Major 


Switch CPU 




Message: 


Free memory too low 


Description: 


The amount of free memory is too low. 
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Element Type 


Rule, Trend Variable, Threshold Window Severity 


Recommendations: 


• Add more memory. 

• Free reserved memory. 


Router CPU 

Message: 

Description: 

Recommendations: 


PFM, Total Buffers outside 10% from mean 15/60 min Warning 
Configuration change - Total Buffers 

The total number of buffers used to hold frames within the router or switch is often 
controlled by configuration choices. Thus a change in the total buffers signals a change 
in the configuration. This may not be a problem. 

• This alarm is for information only. It may point to a configuration change that 
caused memory related problems in the router or switch. 


Router CPU 

Message: 

Description: 

Recommendations: 


TOT, Buffer Misses > 0.01 misses/sec 15/60 min Warning 
Misconfigured buffers - Router buffer misses 

On Cisco routers and switches, memory buffers come in different sizes to hold 
different size frames when they are received. For example, when a small frame is 
received, and all the small buffers are busy, the router will count this as a small buffer 
miss, and use a larger sized buffer to hold the frame. 

• Buffer misses do not cause frames to be discarded, unless all the buffers are full. 
Check for Discards In. 

• Buffer misses indicate a small decrease in efficiency of memory usage, and 
slightly more processing of the frames forwarded. 

• Buffer misses indicate that not enough memory has been allocated to that size 
buffer pool. Increase the number of buffers allocated to the pool. 


Router CPU 
Switch CPU 
Message: 
Description: 
Recommendations: 


TOT, Fan Status > 2.5 1/60 min Major 
Fan Failed 

The fan in the switch has failed. 
• Fix or replace the fan. 


Router CPU 
Switch CPU 
Message: 
Description: 
Recommendations: 


TOT, Fan Status > 1 .5 30/60 min Minor 
Fan Marginal 

The fan in the switch is marginal, and is in danger of failing. 
• Fix or replace the fan. 


Router CPU 
Switch CPU 
Message: 
Description: 

Recommendations: 


TOT, Power Supply 1 Status > 2.5 1/60 min Major 
Power Supply 1 Failed 

Power supply #1 in the router or switch has failed. If this is the only power supply, or 
the other one has failed as well, the router will go down when it runs out of battery 
power. 

• Fix or replace the power supply. 


Router CPU 
Switch CPU 
Message: 
Description: 
Recommendations : 


TOT, Power Supply 1 Status > 1 .5 30/60 min Minor 

Power Supply 1 Marginal 

Power supply #1 is marginal. 

• Fix or replace the power supply. 


Router CPU 
Switch CPU 
Message: 
Description: 

Recommendations: 


TOT, Power Supply 2 Status > 2.5 1/60 min Major 
Power Supply 2 Failed 

Power supply #2 in the router or switch has failed. If this is the only power supply, or 
the other one has failed as well, the router will go down when it runs out of battery 
power. 

• Fix or replace the power supply. 
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Element Type 


Rule, Trend Variable, Threshold Window 


Severity 


Router CPU 


TOT, Power Supply 2 Status > 1 .5 30/60 min 


Minor 


Switch CPU 






Message: 


Power Supply 2 Marginal 




Description: 


Power supply #2 is marginal. 




Recommendations: 


• Fix or replace the power supply. 




Router CPU 


TOT, Temperature Status > 2.5 1/60 min 


Major 


Switch CPU 






Message: 


Critical High Temperature 




Description: 


The temperature is too high. 




Recommendations: 


• Check the air conditioning in the room where the router is located. 






• Check the fan in the router. 






• Lower the temperature. 






• If all else fails, shut down the router or switch. 




Router CPU 


TOT, Temperature Status > 1 .5 30/60 min 


Minor 


Switch CPU 






Message: 


Marginal Temperature 




Description: 


The temperature is marginal, and may soon be too high. 




Recommendations: 


• Check the air conditioning in the room where the router is located. 






• Check the fan in the router. 






• Lower the temperature. 




Router CPU 


TOT, Topology Changes > 1 .5 30/60 min 


Major 



Switch CPU 

Message: Bridge (spanning tree) Topology changing 

Description: A change in the topology of the switched/bridged LAN has caused the spanning tree to 

change. This bridge or switch has received a spanning tree change announcement from 
another switch. 

When the spanning tree changes, the switch/bridge may have to relearn where stations 
are located. While this is occurring, the bridge will forward all frames on all interfaces, 
thus increasing the network traffic. Under some conditions, spanning tree changes can 
cause frames to be discarded at the switch. 
Recommendations: • Check other switches in the extended, switched LAN too if any which switch or 

bridge has gone down. 

• Check the discards to see if the switch discarded a large number of frames as a 
result of the topology change. 

• Some topology changes are caused by a switch temporarily losing 
communications with a neighboring switch. Check to see if any interfaces on this 

switch are discarding too many frames, or have errors. 
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7.3 Router or Switch - Unusual Workload Profile 



Table 25 Router or Switch - Unusual Workload 



Element 



Rule, Trend Variable, Threshold 



Window 



Severity 



Router 

Message: 

Description: 

Recommendations: 



Router CPU 

Message: 
Description: 

Recommendations: 



Router CPU 

Message: 
Description: 

Recommendations : 



Router CPU 

Message: 
Description: 

Recommendations: 



Switch Plus Backplane 



15/60 min 



Warning 



(DFM, Frames In above 99.9%) AND (PFM, 
Frames in above 25%) 

(DFM, Frames In above 99.9%) AND (PFM, 
Frames in above 25%) 
Unusually high frames in 
Unusually high frames out 

The number of frames in or out is unusually high. The alarm is raised only if the 
frames are 25% above the mean. Refer to section 12.1. 

• Drilldown to a Trend report to compare the current usage with the baseline. 

• Drilldown to an AAG report for this router to see if any related changes have 
occurred. 

• For a router, where the number of frames out should be roughly the number of 
frames in, these alarms will normally be raised together. 

• For a switch, where the switch receives many frames which are not forwarded, the 
two alarms are independent. 

• Any router or switch has a limit on the number of frames it can forward. 

(DFM, CPU Utilization above 99%) AND (PFM, 1 5/60 min Warning 

CPU Utilization above 25%) 
Unusually high CPU utilization 

The amount of CPU utilization out is unusually high. The alarm is raised only if the 
CPU utilization is 25% above the mean. Refer to section 12.1. 

• Drilldown to a Trend report to compare the current usage with the baseline. 

• Run an AAG report for this router to see if any related changes have occurred. 

• If the CPU Utilization is too high, refer to the discussion of the CPU Utilization 
too high in Error! Reference source not found, for recommendations. 



15/60 min 



Warning 



(DFM, Buffers Used above 99%) AND (PFM, 
Buffers Used above 25%) 
Unusually high buffers used 

The number of buffers used is unusually high. The alarm is raised only if the number 
of buffers used is 25% above the mean. Refer to section 12.1. 

• Drilldown to a Trend report to compare the current usage with the baseline. 

• Run an AAG report for this router to see if any related changes have occurred. 

• If too many buffers are used, 

• If there are any interfaces which have a high Bandwidth Utilization Out, 
increase the speed of the interface to lower the number of buffers needed to 
hold frames forwarded out the interface. 

• Increase the memory allocated to buffers, this may require increasing the 
memory in the router or switch. 



15/60 min 



Warning 



(DFM, Free Memory below 99%) AND (PFM, 
Free Memory below 25%) 
Unusually low free memory 

The amount of free memory is unusually low. The alarm is raised only if the free 
memory is 25% below the mean. Refer to section 12.1. 

• Drilldown to a Trend report to compare the current usage with the baseline. 

• Run an AAG report for this router to see if any related changes have occurred. 

• If the free memory is too low, refer to the discussion of the Free memory too low 
alarm in Error! Reference source not found, for recommendations. 



(DFM, Backplane Utilization above 99%) AND 
(PFM, Backplane Utilization above 25%) 



15/60 min 



Warning 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Message: Unusually high backplane utilization 

Description: The amount of backplane utilization out is unusually high. The alarm is raised only if 

the backplane utilization is 25% above the mean. Refer to section 12.1. 
Recommendations: • Drilldown to a Trend report to compare the current usage with the baseline. 

• Run an AAG report for this switch to see if any related changes have occurred. 

• If the backplane utilization is too high, refer to the discussion of the backplane 
utilization too high in Error! Reference source not found, for recommendations. 
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8 Server Profiles 

Three profiles are provided for servers. These profiles apply to any server. 

• Server - Delay, see Table 26. 

• Server - Failure, see Table 27. 

• Server - Unusual Workload, see Table 28. 

8.1 Server - Delay Profile 

Server performance can be delayed by any of the following 5 components: 

1 . The CPU and its speed in executing instructions. 

2. The disk I/O subsystem, which reads and writes data to disks. 

3 . The memory subsystem, including physical and virtual memory. 

4. Partition (or File System) capacity. 

5 . Network I/O bandwidth . 

Alarm rules for each of these components are included in the Server - Delay profile. 
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Table 26 Server - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Generic Server 
Managewise Server 
Insight Manager Server 
BMC NT Server 
BMC Unix Server 
Empire Unix Server 
Empire NT Server 
Message: 
Description: 



Recommendations: 



Generic Server 
BMC NT Server 
BMC Unix Server 
Empire Unix Server 
Empire NT Server 
Message: 
Description: 



Recommendations: 



TOT, CPU Imbalance > 10 



15/60 min 



Minor 



CPU Imbalance 

In a server with more than one CPU, CPU Imbalance measures how well the workload 
is balanced between the processors. A value of zero indicates that the CPUs all have 
the same CPU utilization, while a value of 100 means the CPUs all have maximally 
different CPU utilization's. If a 2 processor system has one processor with 100% 
utilization, while the other has 0% utilization, then that system has a CPU Imbalance 
of 100. In such a case, the benefit of having a second CPU is lost. 

• Examine the CPU Utilization of all the processors. In particular, look at the time 
spent in User versus System time. In some operating systems or certain hardware 
configurations, one processor handles most or all of the hardware interrupts. If 
such a system is spending a lot of time in System mode, the processor load may be 
imbalanced. Changes to the hardware or operating system may be able to resolve 
this problem. 



TOT, Pages Paged In > 10 pages/sec 



15/60 min 



Minor 



Paging too high 

On NT systems, the Pages Paged In measures the rate pages are paged in from paging 
files on disk to physical memory due to a page fault by a process. Because the time it 
takes to page in a page is so high, and because any page paged in causes a disk I/O, too 
high Pages Paged In indicate the system's virtual memory system is in trouble. 
• If the system is low on physical memory, add more memory. 



Empire NT Server 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, Free Memory < 4000000 
Available memory too low 

Windows NT attempts to keep 4 Mbytes of available memory at all times. Available 
memory is free physical memory, i.e., memory not dedicated to the operating system 
or any process. 

• Add physical memory. 

• Examine applications to see if they can make more efficient use of memory or if 
they can localize their memory use. 



Empire Unix Server 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, Load Average > 2 
Load average too high 

The 5 minute load average on Unix systems is a measure of the length of the process 
run queue. It measures how many processes are running, or would like to run. When 
the load average is high, processes must wait to get their turn to use a processor. 

• Add an additional processor, 

• Move some of the users to a different machine to split the load. 

• If the user time is high, examine the applications to see if they can be optimized. 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Empire Unix Server 

Message: 

Description: 



Recommendations: 



15/60 min 



Minor 



TOT, Pages Scans > 200 pages scanned/sec 
Page scans too high 

On Unix systems, in particular on Solaris, when the operating system is running short 
of free physical memory it "scans" pages to see if they are candidates to be paged out 
to swap space. The rate at which the operating system scans pages measures how 
frantic the operating system is to free memory. On Unix, any form of I/O causes a 
page fault, and so the page fault rate can be a misleading indicator of memory 
problems. A system performing well and doing a lot of I/O can have a high page fault 
rate. 

• Add physical memory. 

• Examine applications to see if they can make more efficient use of memory or if 
they can localize their memory use. 



Generic Server TOT, Average CPU Utilization > 90% 1 5/60 min Minor 

Managewise Server 

Insight Manager Server 

BMC NT Server 

BMC Unix Server 

Empire Unix Server 

Empire NT Server 

Message: CPU too busy 

Description: The CPUs on the server are too busy as measured by their average CPU Utilization. If 

the processors are too busy, the CPU run queue length often grows and user requests 
are delayed. Refer to section 12.5 for a discussion. 

Recommendations: • Add an additional processor. 

• Move some of the users to a different machine to split the load. 

• If the user time is high, examine the applications to see if they can be optimized. 



Server Disk TOT, Disk I/O Utilization > 50% 1 5/60 min Minor 

Message: Disk too busy 

Description: The Disk I/O Utilization measures the percentage of time a disk is busy transferring 

data to or from the disk. When the disk is too busy, the disk queue grows and transfers 
must wait their turn to use the disk. Disk I/Os can result from application or system 
activity, or from paging. 

Recommendations: • If the system is paging, try to fix that problem first. 

• Split the workload (as measured by disk reads and writes) equally across multiple 
disks. For example, if the transfers are related to paging, set up swap or paging 
files on another disk. Striping a file system across multiple disks can also spread 
I/Os across multiple disks. 

• Examine the applications to see if they can use the disk more efficiently, 

• Add another disk. Consider adding a separate disk control for the disk. 



Server Disk TOT, Disk Queue Length > 2 15/60 min Minor 

Message: Disk queue too long 

Description: Disk Queue Length measures the length of the queue of I/Os waiting or using the disk. 

As the disk queue grows, the time an I/O must wait for other I/Os to complete grow as 
well. This slows all disk operations, and slows the response time of any application 
that performs I/O. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

Recommendations: • If the system is paging, try to fix that problem first. 

• Split the workload (as measured by disk reads and writes) equally across multiple 
disks. For example, if the transfers are related to paging, set up swap or paging 
files on another disk. Striping a file system across multiple disks can also spread 
I/Os across multiple disks. 

• Examine the applications to see if they can use the disk more efficiently. 

• Add another disk. Consider adding a separate disk control for the disk. 



NT Process Set TOT, Total Page Faults > 25 1 5/60 min Minor 

Message: Page faults too high 

Description: The page fault rate is high for this process set. 

Recommendations: • 
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8.2 Server - Failure Profile 



Table 27 Server - Failure 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Generic Server 
Man age wise Server 
Insight Manager Server 
BMC NT Server 
BMC Unix Server 
Empire Unix Server 
Empire NT Server 
Message: 
Description: 

Recommendations: 



Availability 



30 min 



Critical 



Server Down 

The server went down and has come back up. This alarm will be cleared after the 
system has been up for the window period. 



Generic Server 
Managewise Server 
Insight Manager Server 
BMC NT Server 
BMC Unix Server 
Empire Unix Server 
Empire NT Server 
Message: 
Description: 



Reachability 



30 min 



Critical 



Recommendations: 



Server Unreachable 

A server is unreachable if eHealth gets no response to a series of pings of the server, or 
if eHealth is unable to poll the device using SNMP. The server may be down, the 
network path to the server may be down, or the SNMP agent on the server may not be 
functioning. This alarm will be cleared after the server is reachable for the window 
period. 

• If the Unreachable alarm is followed by a Down alarm, the server went down. 

• Ping the server. If it is reachable via ping, the network path is now up. 

• Examine a trend chart of latency to the server leading up to the failure. If the 
latency was growing, and approaching the ping timeout, then the network may be 
so slow that eHealth is failing to reach the device within the timeout period. In 
that case, you should solve the delay problem. You could increase the ping 
timeout. 

• If ping fails, examine routers along the route to the device to see if any are down, 
or unreachable. 

• Look at the polling status window to see if eHealth encountered problems in 
polling the device and correct them. 



Unix Process Set 


Availability 30 min 


Critical 


NT Process Set 






Message: 


Process set down 




Description: 


The process set is down if any of its critical processes are down. 




Recommendations: 


♦ Restart the application. 





Generic Server 

Message: 

Description: 

Recommendations : 



15/60 min 



Major 



TOT, Virtual Memory Utilization > 90% 
Virtual Memory Usage too high 
The virtual memory utilization is too high. If the server should use all its virtual 
memory, the server could crash, stop, or otherwise suffer a critical failure. 

• Increase the size of virtual memory available to the server. 

• Lower the virtual memory used, either by removing applications, or lowering the 
virtual memory used by some applications. 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



BMC Unix Server 
Empire Unix Server 
Message: 
Description: 

Recommendations : 



Empire NT Server 

Message: 

Description: 

Recommendations: 



TOT, Virtual Memory Utilization > 90% 



15/60 min 



Major 



User Partition, 
System Partition 
Message: 
Description: 



Recommendations : 



Message: 
Description: 



Recommendations: 



Swap Space usage too high 

The swap space utilization is too high. If the server should use all its swap space, the 
server could crash, stop, or otherwise suffer a critical failure. 

• Increase the swap space available to the server, either increase existing swap 
space, or add new swap space on other disks. 

• Lower the memory used, either by removing applications, or lowering the 
memory used by some applications. 

TOT, Virtual Memory Utilization > 90% 15/60 min Major 

Paging File usage too high 

The paging file utilization is too high. If the server should use all its page file space, 
the server could crash, stop, or otherwise suffer a critical failure. 

• Increase the size of page files available to the server. Either increase the page file 
size on existing disks, or add new page files on disks that do not have page files, 

• Lower the memory used by the system, either by removing applications, or by 
lowering the memory used by some applications. 



TOT, Inode Utilization > 95% 



5/60 min 



Major 



Running out of inodes 

Inodes are data structures on disk used in Unix file systems to hold a description of the 
file. The number of inodes, and hence the maximum number of files that can be held in 
a file system is set when the file system is made. Running out of inodes will prevent 
new files from being created on the file system. 

• Increase the number of inodes on this file system. 

• Free inodes by deleting or moving small files to another disk. 



User Partition, 
System Partition 
Message: 
Description: 
Recommendations: 


TOT, File Allocation Failures > 0 
File allocation failures 

A user could not allocate a file on a file system. 
• 


5/60 min 


Major 


User Partition 
System Partition 


TODT, Partition Utilization > (100% - 99.9 th 
percentile) 


5/60 min 


Major 



Partition running out of space 

LiveExceptions measures the normal variation in disk space used over the past 6 week 
long baseline period. This 99.9 th percentile variation is then used as the threshold of 
the amount of free space that should be left on the disk. 

• Increase the amount of space available on this disk. 

• Lower the disk space used by moving files to another disk. 
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8.3 Server - Unusual Workload 



Table 28 Server - Unusual Workload 



Element Type 


Rule, Trend Variable, Threshold Window Severity 


Empire Unix Server, 
Empire NT Server 

Message: 
Description: 

JxcL-UlIlJIlcIIUaliUlia* 


• (DFM, Processes above 99.9 percentile) AND 1 5/60 min Warning 
(PFM, Processes above by 10%) 

• (DFM, Processes below 99.9 percentile) AND 
(PFM, Processes below by 10%) 

• Unusually high processes 

• Unusually low processes 

The number of processes running on the system is unusually high, or unusually low. 
The number of processes must be at least 10% above the mean to raise the alarm, refer 
to section 12.1. 

• ii mc iiumuer oi pruLcobvd die unusually itign, a new appiicaiiun may oc running, 

• If the number of processes is unusually low, an application may be down. 


Server CPU 

Message: 
Description: 

Recommendations: 


(DFM, CPU System Utilization > 99.9 percentile) 15/60 min Warning 

AND (PFM, CPU System Utilization 10% above 

mean) 

Unusually high CPU system utilization 

The CPU System (or Kernel) Utilization is unusually high. System Utilization 
measures the percent of time the CPU is busy performing system functions such as 
I/O, scheduling, handling interrupts, or processing system calls. The CPU System (or 
Kernel) Utilization must be at least 1 0% above the mean to raise the alarm, refer to 
section 12.1. 

• Systems that are I/O bound, or busy processing interrupts often show a high CPU 
System Utilization. 


Server CPU 


• (DFM, CPU Idle Utilization > 99.9 percentile) 1 5/60 min Warning 
AND (PFM, CPU Idle Utilization 10% above 

mean) 

• Unusually high CPU IO wait time 

For Unix systems, the CPU IO Wait Utilization is unusually high. The CPU I/O Wait 
Utilization measures the percent of time the CPU is idle waiting for an I/O operation to 
complete. The CPU IO Wait Utilization must be at least 10% above the mean to raise 
the alarm, refer to section 12.1. 

• Unix systems that are waiting for I/O are busy are wasting time. 


Message: 
Description: 

Recommendations: 


Empire Unix Server, 
Empire NT Server 

Message: 
Description: 

Recommendations: 


• (DFM, Processes above 99.9 percentile) AND 15/60 min Warning 
(PFM, Processes above by 10%) 

• (DFM, Processes below 99.9 percentile) AND 
(PFM, Processes below by 10%) 

• Unusually high processes 

• Unusually low processes 

The number of processes running on the system is unusually high, or unusually low. 
The number of processes must be at least 10% above the mean to raise the alarm, refer 
to section 12.1. 

• If the number of processes are unusually high, a new application may be running. 

• If the number of processes is unusually low, an application may be down. 


Empire Unix Server, 
Empire NT Server 

Message: 


• (DFM, Processes above 99.9 percentile) AND 1 5/60 min Warning 
(PFM, Processes above by 1 0%) 

• (DFM, Processes below 99.9 percentile) AND 
(PFM, Processes below by 10%) 

• Unusually high processes 

• Unusually low processes 
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Element Type 


Rule, Trend Variable, Threshold Window Severity 


Description: 
Recommendations: 


The number of processes running on the system is unusually high, or unusually low. 
The number of processes must be at least 10% above the mean to raise the alarm, refer 
to section 12.1. 

• If the number of processes are unusually high, a new application may be running. 

• If the number of processes is unusually low, an application may be down. 
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9 RAS, Modem, ISDN, Modem Pool 

These profiles are yet to be described. 

9.1 Remote Access - Delay Profile 



Table 29 Remote Access - Delay 



Element Type 


Rule, Trend Variable, Threshold 


Window 


Severity 


RAS 

Message: 

Description: 

Recommendations: 


TOT, Modems Busy % > 95% 
Modems over used 

• 


15/60 min 


Minor 


Modem Pool 
Message: 
Description: 
Recommendations: 


TOT, Modems Busy % > 95% 
Modems over used 

• 


15/60 min 


Minor 


RAS 

Message: 

Description: 

Recommendations: 


TOT, Discarded Frames % > 5% 

Too many discards on dial-in connections 

• 


15/60 min 


Minor 


RAS CPU 
Message: 
Description: 
Recommendations: 


TOT, CPU Utilization % > 60% 
Modems over used 

• 


15/60 min 


Minor 
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9.2 Remote Access - Failure Profile 



Table 30 Remote Access - Failure 



Element Type 


Rule, Trend Variable, Threshold 


Window 


Severity 


RAS 

Message: 

Description: 

Recommendations: 


Availability 

Remote Access Server Down 
• 


30 min 


Critical 


RAS 

Message: 

Description: 

Recommendations: 


Reachability 

Remote Access Server Unreachable 
• 


30 min 


Critical 


RAS 

Message: 

Description: 

Recommendations: 


TOT, Free Memory < 2000000 
Free memory too low 

• 


5/60 min 


Major 


RAS 

Message: 

Description: 

Recommendations: 


TOT, Modem Errors > 0.01 errors/sec 
Too many modem errors 

• 


15/60 min 


Major 


Ethernet 
Message: 
Description: 
Recommendations: 


TOT, Retrains > 0.05 retrains/sec 
Too many retrains 

• 


15/60 min 


Major 


Ethernet 
Message: 
Description: 
Recommendations: 


TOT, Frame Errors % > 1% 
Too many frame errors 

• 


15/60 min 


Major 
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9.3 Remote Access - Unusual Workload Profile 



Table 31 Remote Access - Unusual Workload 



Element Type 


Rule, Trend Variable, Threshold 


Window 


Severity 


RAS 

Message: 
Description: 

Recommendations: 


• (DFM, Bits In above 99.9 th percentile) AND 1 5/60 min Warning 
(TOT, Connect Time % > 25%) 

• (DFM, Bits In above 99.9 th percentile) AND 
(TOT, Connect Time % > 25%) 

• Unusually high bits in 

• Unusually high bits out 

The number of Bits In or Out of the modems is unusually high. The second clause 
limits this alarm to conditions where the modems are connected more than 25% of the 
time. 
• 


RAS 

Message: 

Description: 

Recommendations: 


(DFM, Connect Time % above 99 th percentile) 
AND (TOT, Connect Time % > 25%) 
Unusually high connect time % 

• 


15/60 min 


Warning 


RAS 

Message: 

Description: 

Recommendations: 


(DFM, Connections above 99 th percentile) AND 
(PFM, Connections 50% above mean) 
Unusually high connections 

• 


15/60 min 


Warning 


RAS 

Message: 

Description: 

Recommendations: 


(DFM, Memory Utilization above 99 th percentile) 
AND (PFM, Memory Utilization 10% above 
mean) 

Unusually high memory utilization 
• 


15/60 min 


Warning 


RAS CPU 

Message: 

Description: 

Recommendations: 


(DFM, CPU Utilization above 99 th percentile) 
AND (PFM, CPU Utilization 25% above mean) 
Unusually high CPU utilization 

• 


15/60 min 


Warning 


Modem Pool 

Message: 

Description: 

Recommendations: 


(DFM, Connect Time % above 99 th percentile) 
AND (TOT, Connect Time % > 25%) 
Unusually high connect time % 

• 


15/60 min 


Warning 
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10 Response Profiles 

The Response - Delay profile, see Table 32, covers performance problems related to delay as well as failures. It 
raises alarms when the service level agreement (SLA) is violated. 

Response time can be measured in two ways: 

• By using a test generator agent, either the sysEdge Service Response module (SR) or the Cisco Service Assure 
Agent (SAA) that generates transaction attempts at regular intervals. 

• By using an observational agent, the FirstSense agent (FS), which monitors user transactions and measures the 
actual response time the user experiences. 

In either case, the agent measures response for a Response Path, from a Source to a Destination for a particular 
Application or Protocol. 

For each response path monitored with this profile, the path's Response Limit should be set to the maximum service 
response time allowed by the SLA for this application or protocol. For example, say the SLA states that the 
maximum response time for a DNS query should by 1 second. The Response Limit for all the DNS paths should be 
set to 1000 milliseconds (= 1 second). You can set the Response Limit using the Path Manager in the Poller 
Configuration in the Network Health Console. 
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10.1 Response - Delay Profile 



Table 32 Response - Delay 



Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Response Path, 
Application Response 
Path, 

Jitter Response Path, 
FirstSense Response 
Path, 

Empire Service 
Response Path 
Message: 
Description: 

Recommendations: 



TOT, Response/Limit > 100% 



15/60 min 



Major 



Response over Limit 

The Response time is greater than the service level agreement allows for the 
application or protocol. 

Generic recommendations: 



Check to see if the response limit has been set and that the value is reasonable. 
To diagnose why response time might be slow for the path, drill down to an AAG 
report. 

If the problem appears to be a slow network route from source to destination, 
diagnose and correct the problem as described in 10.2. 

If the problem appears to be a slow destination (server) for an application path, 
look for any alarms or exceptions from the Server - Delay profile for the 
destination server. 



Specific recommendations depend on the particular kind of path. 

• For Cisco SAA or sysEdge SR paths measuring network protocols such as Ping, 
UDP Echo, or Jitter Tests: 
* Refer to section 10.2. 
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Element Type Ride, Trend Variable, Threshold Window Severity 

• For Cisco SAA paths measuring application protocols, such as HTTP, FTP, or 
email: 

• Check the response time for a network level path that parallels the application 
path, i.e., that has the same source and destination, but measures network 
delay directly using a protocol like Ping or UDP Echo. 

• If that path is slow, suspect the delay is in the network. Refer to section 10.2. 

• If the network is not slow, check the server: 

• See if any alarms are active on the server. 

• The Response Destination AAG should show if the problem is common 
to all paths (test sources) or specific to this path. This report can be run 
from the Path AAG. 

• The Server AAG report for the server should pinpoint any problems 
within the server. This report can be run from the Response Destination 
AAG. 

• If neither the network nor the server is causing the problem, the delay may be 
within the source router running the test, 

• See if any alarms are active on the source router. 

• The Response Source AAG should show if the problem is common to all 
paths whos source is the router or specific to this path. This report can be 
run from the Path AAG. 

• The Router AAG report for the source system should pinpoint any 
problems within the router, in particular, look at the CPU utilization for 
the router. This report can be run from the Response Source AAG. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

• For sysEdge SR paths measuring application protocols, such as HTTP, FTP, or 
email: 

• Drilldown to an AAG report to determine if the bulk of response time is in the 
DNS lookup time, TCP Connect Time, or the actual Transaction Time. 

• If the DNS Lookup Time is long: 

• Check to see if the DNS Server is working properly. 

• Check the response time for a path from this source SR agent to the DNS 
server to see if DNS is slow. 

• If the TCP Connect Time is long, the delay may be in the network. 

• Check the response time for a network level path that parallels the application 
path, i.e. has the same source and destination, but measures network delay 
directly, using a protocol like Ping or UDP Echo. 

• If that path is slow, suspect the delay is in the network. Refer to 10.2. 

• If the network is not slow, check the server: 

• See if any alarms are active on the server. 

• The Response Destination AAG should show if the problem is common 
to all paths (test sources) or specific to this path. This report can be run 
from the Path AAG. 

• The Server AAG report for the server should pinpoint any problems 
within the server. This report can be run from the Response Destination 
AAG. 

• If neither the network nor the server is causing the problem, the delay may be 
within the source system running the test. 

• See if any alarms are active on the source system. 

• The Response Source AAG should show if the problem is common to all 
paths (test destinations) or specific to this path. This report can be run 
from the Path AAG. 

• The Server AAG report for the source system should pinpoint any 
problems within the system. This report can be run from the Response 
Destination AAG. 
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Element Type Rule, Trend Variable, Threshold Window Severity 

• For Cisco SAA or sysEdge SR paths measuring network services such as DNS: 

• Check the response time for a network level path that parallels the application 
path, i.e. has the same source and destination, but measures network delay 
directly, using a protocol like Ping or UDP Echo. 

• If that path is slow, suspect the delay is in the network. Refer to 10.2. 

• If the network is not slow, check the server: 

• See if any alarms are active on the server. 

• The Response Destination AAG should show if the problem is common 
to all paths (test sources) or specific to this path. This report can be run 
from the Path AAG. 

• The Server AAG report for the server should pinpoint any problems 
within the server. This report can be run from the Response Destination 
AAG. 

• If neither the network nor the server is causing the problem, the delay may be 
within the source system running the test. 

• See if any alarms are active on the source system. 

• The Response Source AAG should show if the problem is common to all 
paths (test destinations) or specific to this path. This report can be run 
from the Path AAG. 

• The Server AAG report for the source system should pinpoint any 
problems within the system. This report can be run from the Response 
Destination AAG. 

• For FS paths measuring application transactions such as SAP, Oracle, or 
Exchange: 

• Drilldown to an AAG report to determine if the bulk of response time is in the 
Client, the Server, or the Network. 

• If the network is slow, refer to 10.2. 

• If the server response time is slow. 

• See if any alarms are active on the server. 

• The Response Destination AAG should show if the problem is common 
to all paths (source clients) which use this server, or specific to this 
client. This report can be run from the Path AAG. 

• The Server AAG report for the server should pinpoint any problems 
within the server. This report can be run from the Response Destination 
AAG. 

• If client response time is slow: 

• See if any alarms are active on the source system. 

• The Response Source AAG should show if the problem is common to all 
paths (test destinations) or specific to this path. This report can be run 
from the Path AAG. 

• If the client system is an NT system, the Empire sysEdge agent could be 
run on that system. A Server AAG report for the client system should 
pinpoint any problems within the system. This report can be run from the 
Response Source AAG. 



Response Path, TOT, Attempts < 0.001% 15/60 min Major 

Application Response 

Path, 

Response Path with 
Jitter, 

Empire Service 

Response Path 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Message: 
Description: 



Recommendations: 



Response Path, 
Application Response 
Path, 

Jitter Response Path, 
Empire Service 
Response Path 
Message: 
Description: 



Recommendations: 



Response Path, 
Application Response 
Path, 

Jitter Response Path, 
Empire Service 
Response Path 
Message: 
Description: 



Recommendations: 



Jitter Response Path 

Message: 

Description: 

Recommendations: 



No attempts made 

This rule applies only to generated test transactions. The agent made no transaction 
attempts. This often means a problem with the agent, or a problem with the particular 
test configuration. 

• See if the agent is operating. 

• Examine any logs generated by the agent or the eHealth Statistics Poller to see if 
any error messages were generated. 



TOT, Failed Attempts > 20% 



15/60 min 



Major 



Test attempts or transactions failed 

This rule applies only to generated test transactions. Some of the test transaction 
attempts failed. An attempt might fail because the destination is not available or 
because the transaction took longer than the timeout value defined for the path. 

• Drill down to an AAG report for the path. Look to see if the failed transactions 
happen when response is slow. In particular, look at the Maximum Response. If 
response is slow, diagnose the problem as described above under the Response 
over Limit alarm. 

• If the Destination Unreachable alarm is also active then the problem is likely to be 
related to the server, or the network path from source to destination is down. 



TOT, Failed Attempts > 99.99% 



15/60 min 



Major 



Destination unreachable 

This rule applies only to generated test transactions. All of the test transaction attempts 
failed. An attempt might fail because the destination is not available or because the 
transaction took longer than the timeout value defined for the path. 

• Drill down to an AAG report for the path. Look to see if the failed transactions 
happen when response is slow. In particular, look at the Maximum Response. If 
response is slow, diagnose the problem as described above under the Response 
over Limit alarm. 

• If this is an application test (for example, DNS, HTTP, or email) determine if the 
network path is down, or if the server is down. 

• To check the network path, examine a parallel path, one with the same source 
and destination, but for a network protocol such as a Ping or UDP Echo test. 

• To check the destination, run an AAG for the destination, and see if all the 
paths are unable to reach the destination. If so, the problem is likely within 

the server. 

TOT, Jitter > 1 0 msec 1 5/60 min Major 

Too much jitter 

The jitter measured on this path is too large. Jitter can have a severe impact on real 
time voice or video communications. 

• Jitter is often caused by variation in queueing delays in routers, switches, along 
the route, or in the source or destination systems themselves. 

• Jitter can be controlled by giving voice and video traffic priority over data traffic 
in the queueing discipline used in routers and switches. 
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Element Type 



Rule, Trend Variable, Threshold 



Window 



Severity 



Response Path, 
Application Response 
Path, 

Jitter Response Path, 
Service Response Path 
Message: 
Description: 
Recommendations: 



Response Path, 
Application Response 
Path, 

Jitter Response Path, 
Service Response Path 

Message: 

Description: 



DFM, Response above 95 percentile 



15/60 min 



Warning 



Recommendations: 



Unusually slow response 

The response time for this path is unusually slow. 

• If the response time is too slow, see the recommendations for Response over Limit 
above. 

• Check to see if the network or destination is handling an unusually high workload. 

• (DFM, Minimum Response above 95 60/120 min Warning 
percentile) AND (PFM, Minimum Response 

20% above mean) 

• (DFM, Minimum Response below 95 
percentile) AND (PFM, Minimum Response 
20% below mean) 

• Increased minimum response - possible route change 

• Decreased minimum response - possible route change 

The minimum response has changed. It has either increased or decreased. Minimum 
response measures the response time seen when other traffic or work on the server is 
minimized. It is generally a stable measure of the "speed of light delay" from source to 
destination and back again on network protocol tests. However if the route that packets 
follow between source and destination changes, the minimum round trip delay may 
also change. 

This alarm is most useful on network layer tests, such as Ping or UDP Echo tests. For 
application tests, reconfiguration or other changes in the application server can cause 
this alarm. 

• Check to see if the route has changed. 

• If the path is between adjacent routers, and the routers are connected by a Frame 
Relay Circuit or ATM Channel, check with the service provider to see if the 



Empire Service 
Response Path 
Message: 
Description: 
Recommendations: 


DFM, DNS Lookup Time above 99.9 percentile 1 5/60 min Warning 
Unusually slow DNS lookup time 

The time to lookup the DNS name and translate it into an address was unusually slow. 
• Check the DNS server to see if there is a problem with it. 


Empire Service 
Response Path 
Message: 
Description: 
Recommendations: 


DFM, TCP Connect Time above 99.9 percentile 1 5/60 min 
Unusually slow TCP Connect time 

The time taken to establish the TCP connection was unusually slow. 
• Check the network from client to server to see if it is slow. 


Warning 


Empire Service 
Response Path 
Message: 
Description: 

Recommendations: 


DFM, Transaction Time above 99.9 percentile 1 5/60 min Warning 
Unusually slow transaction time 

The time to perform the actual transaction (once the TCP connection was established) 
is unusually slow. 

• Check the application server to see if it is slow. 
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10.2 Diagnosing a slow network path 

A slow network path can be caused by delays in any link, switch, or router along the route from the source to the 
destination, or on the route from the destination back to the source. Alarms from the delay profiles applied to the 
LAN and WAN links, routers, and switches along the route should identify any delay problems caused by these 
network components. 

To determine the route from source to destination, you can log into the source system, and perform a traceroute to 
the destination IP address. This will at least identify the routers along the path. With a basic knowledge of the 
network topology, in particular a knowledge of the WAN links between the routers, you should be able to identify 
the major WAN links the route traverses. Some switches (switches operating at layer 2) will not be seen by 
traceroute. 
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11 Host Latency Profiles 

Two profiles are provided to detect latency problems. These profiles can be applied to any host, router, switch, 
server, or RAS element. The two profiles are: 

• Host - Unusual Latency, Table 33. 

• Host - Latency 2 second limit, Table 34. 

The two profiles are designed to work together. The Unusual Latency profile adapts the threshold based on history, 
and thus does a good job of detecting problems that suddenly appear. However, problems that develop slowly over 
time, or latencies that 

For most users, the latency profile can best be applied to devices, that is, Servers, Routers, Switches, and RAS. 
Latency measures the time to ping the IP address of the host's agent. The same ping latency is used for all the 
elements with that agent address. 

Customers who are only monitoring LAN and WAN elements can create a custom profile which measures the 
latency to a LAN or WAN element. This should be applied selectively to a few LAN/WAN elements, as many of 
them share the same agent IP address. 

Customers using alternate latency to measure the delay over a LAN/WAN link to the other end should apply a 
custom profile using a DFM rule to detect when the link latency changes. 

Table 33 Host - Unusual Latency 

Element Type Rule, Trend Variable, Threshold Window Severity 

Any Host DFM, Latency above 97.7 percentile 1 5/60 min Warning 

Message: Latency to host unusually high 

Description: The network delay from the EHealth poller to the host is unusually high. 

Recommendations: • Drill down to a Trend report of Latency to see the normal range and how the 

current value compares to it. 
• If the latency is too high, determine why and correct it as described in section 
10.2. 



Table 34 Host -Latency 2 second limit 

Element Type Rule, Trend Variable, Threshold Window Severity 

Any Host TOT, Latency > 2000 msec 15/60 min Warning 

Message: Latency to host too high 

Description: The network delay from the EHealth poller to the host is too high. The value of 2 

seconds (2000 milliseconds) depends on your particular network. Depending on the 
size and delays typically encountered in your network, you may increase or decrease 
this threshold. 

Recommendations: • Drill down to a Trend report of Latency to see the normal range, and how the 

current value compares to it. 
• If the latency is too high, determine why and correct it as described in section 
10.2. 



6-Jun-00 



Concord Communications, Inc. 



86 



Live Exceptions Profiles V1.9 



12 Notes 

This section describes notes that apply to many rules. 

12.1 Compound Unusual Value Alarm Rules 

Alarm rules based on Deviation from Mean detect cases where the value is unusual. However, experience has shown 
that many cases where the value is "unusual" are also cases where the standard deviation is very small or the 
workload (traffic) is low. In such cases, the mean and standard deviation of the variable is very small, and the 
normal range is very narrow. Any change from the mean is seen as being unusual, even though the deviation is 
trivial. 

To correct this, many unusual workload rules compound the basic Deviation from Mean rule with a Percent from 
Mean or an Absolute from Mean. Which of these is used depends on the particular case: 

• 3f the standard deviation is small and a reasonable minimum deviation can be identified, then an Absolute from 
Mean can be used to filter out trivial deviations from normal. For example, say we want to detect if the number 
of users logged in to a Unix system is unusually high. We might use the rule (DFM users above the 99.9 
percentile) AND (AFM above 2). The first clause of the rule detects cases during the middle of the day when 
the normal number of users is 40, and the value varies from 20-60 users. The second clause covers the case 
where there are always four users logged on in the middle of the night, and the standard deviation is very small, 
less than 1 . On a night when there are five users logged in, we do not want to raise an alarm. By adjusting the 
absolute range in the second clause, we can filter out more (or less) of these trivial changes. 

• If no absolute range can be determined, we might add a filter clause using Percent from Mean to filter out trivial 
deviations. For example, say we want to detect if the number of page faults is unusually high. We could use the 
rule (DFM, Page Faults above 99 percentile) AND (PFM, Page Faults 100% above mean). The second clause 
filters out cases where page faults are not twice the mean. 

• Some variables may change wildly when the traffic is low. For example, if a WAN link is carrying only 10 
frames per second, then each error per second corresponds to an additional 10% error percentage. To detect 
unusual values in percentages, we could use the rule (DFM, Errors % > 95 percentile) AND (TOT, Frames > 
100). The second clause discards cases where the frames per second is less than 100, which ensures that there 
are enough frames considered to get a reasonably accurate value for Error %. 

12.2 Statistics, Percentiles, and Standard Deviations 

TBS 

12.3 Drilldowns 



12.4 Availability and Reachability Alarms for Hosts 



12.5 Utilization, Queueing, and Delay 
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